Scraping and Plotting Minneapolis Property Prices | RSelenium, ggmap, ggplots

I recall having once scraped data from a Malaysian property site so that I may be able to plot the monthly rental rates for a specific neighborhood in Selangor. This time I thought it might be interesting to try and plot property prices for a specific region in the United States: Minneapolis.
I settled on a site that seemed to show a reasonable amount of information on property sales, and started working on the code. I came up with the following script which ended up giving me about 6000 rows of information:

#load libraries
library(dplyr)
library(rvest)
library(stringr)
library(RSelenium) 

#create an empty vector to be populated later
complete = c(NA) 

#From RSelenium
checkForServer() #Check if server file is available
startServer() #Start the server
mybrowser = remoteDriver(browser = "chrome") #Change the browser to chrome
mybrowser$open(silent = TRUE) #Open the browser 

#Link to location sub categories
link_main = "site search results page goes here" 

#Navigate to the site
mybrowser$navigate(link_main) 

#get the web page's html source code
mybrowser$getPageSource()[[1]] %>% html() -> source_main 

#scrape all the sub categories and remove unnecessary words
source_main %>% html_nodes(".property-records-content") %>%
  html_nodes("li") %>% html_nodes("a") %>% html_text() %>%
  str_replace_all(pattern = " For Rent", replacement ="") %>%
  str_replace_all(pattern = " For Sale", replacement ="") %>%
  str_replace_all(pattern = " ", replacement = "-") -> Locations 

#edit locations to be used in URL
paste(Locations, "state code goes here", sep = "") -> Locations 

#first half of the URL to be combined with sub categories
prefix_main = "the site's search results page goes here" 

#set maximum number of pages to scrape through. this will be amended in the loop
total_pages = 10 

#there are real estate posts for each sub category.
#start loop for each sub category j.
for(j in 1:length(Locations)){ # 

#start second loop to go through all posts in each sub category
    for(i in 1:total_pages){
      #create new URL combining sub category, maximum posts per page, and page number
      link = paste(prefix_main, Locations[j], "/pg-", i, "?pgsz=50", sep = "")
      #navigate to this URL
      try(mybrowser$navigate(link)) 

      #get the html page source for the posts
      try(hLink = html(mybrowser$getPageSource()[[1]]))
      #extract the link to each post in the search results
      hLink %>% html_nodes("#srp-list") %>%
        html_nodes(".srp-item-ldp-link.hidden-xxs.hidden-xs.hidden-sm.hidden-md") %>%
        html_nodes("a") %>% html_attr("href") -> links
      #add extraction results to empty vector created in the beginning
      complete = c(complete, links)
      #extract the total number of real estate posts in the search results,
      #and assign it to the total number pages to scrape through.
      #previous value of total_pages variable was 10.
      hLink %>% html_nodes(".srp-list-footer") %>%
        html_text() %>% str_trim() %>%
        str_replace_all(pattern = "Found ", replacement = "") %>%
        str_replace_all(pattern = " matching properties", replacement = "") %>%
        str_replace_all(pattern = ",", replacement = "") %>% as.integer() -> total_posts 

      round(total_posts/50) -> total_pages 

    }
  }
#create a new URL prefix
prefix = 'first half of the site goes here"
#links to posts are missing the prefix,
#so to combine them with the all the links that were extracted.
paste(prefix, complete, sep = "") -> complete
#remove the first value of the complete vector because
# the first value is NA
complete[2:length(complete)] -> complete 

#create three empty variables, one empty list, and two empty vectors
#these variables will be populated later
details = list()
prices = c()
title = c() 

#start the loop that will go through all the posts and
#extract all the details for each real estate post
for(i in 1:length(complete)){
    #navigate to each link saved in the complete variable
    try(mybrowser$navigate(complete[i]))
    #get html source of the page
    try(hLink = html(mybrowser$getPageSource()[[1]]))
    #extract title of the post
    hLink %>% html_nodes(".listing-header") %>%
      html_nodes(".col-xs-7.col-xs-pull-5.col-xxs-12") %>%
      html_nodes("span") %>% html_text() -> title_1
    #collapse all the vector holding the title details and..
    paste(title_1, collapse = " - ") -> title_1
    #..populate the empty vector
    title = c(title, title_1) 

    #extract price and populate the empty vector
    hLink %>% html_nodes(".ldp-header-price") %>%
      html_nodes("span") %>% html_text() %>% paste(collapse = "") -> prices_1 

    prices = c(prices, prices_1) 

    #extract the bed, bath, and sqft details of the property and then..
    #..populating the empty list created earlier
    hLink %>% html_nodes(".property-meta.list-horizontal.list-style-disc.list-spaced") %>%
      html_nodes("li") %>%
      html_text() %>%
      unique() %>% str_replace_all(pattern = "\\n" , replacement = "") %>%
      str_replace_all(pattern = "  ", replacement = "")-> details[[i]] 

} 

#create an empty dataframe to be populated later
#the number of rows are the total number of links scraped through
#total number of columns are the details needed from each post
matrix(NA, length(complete), 6) %>% data.frame() -> comp_df
#name the columns of the dataframe
names(comp_df) = c("title", "prices", "bed", "bath", "sq_ft", "acres")
#start loop to populate dataframe
for(i in 1:length(complete)){
  #populate the titles
  comp_df[i,"title"] = title[i]
  #populate the prices
  comp_df[i, "prices"] = prices[i]
  #populate beds
  details[[i]] %>% grep(pattern = "bed") -> x
  if(length(x) == 0){comp_df[i, "bed"] = NA}else{comp_df[i, "bed"] = details[[i]][x]}
  #populate baths
  details[[i]] %>% grep(pattern = "bath") -> x
  if(length(x) == 0){comp_df[i, "bath"] = NA}else{comp_df[i, "bath"] = details[[i]][x]}
  #populate sq ft
  details[[i]] %>% grep(pattern = "sq ft") -> x 

      if(length(x) == 0){comp_df[i, "sq_ft"] = NA}else{  

          if(length(x) != 1){  

              grepl("lot", x) -> lot_log 

              x[!lot_log] -> x 

          }else{comp_df[i, "sq_ft"] = details[[i]][x]} 

      }
  #populate acres
  details[[i]] %>% grep(pattern = "acre") -> x
  if(length(x) == 0){comp_df[i, "acres"] = NA}else{comp_df[i, "acres"] = details[[i]][x]} 

}
#export final data frame
write.csv(comp_df, "DF_unclean.csv", row.names = FALSE) 

The resulting data frame holds the following fields:
Title: The title of the post which shows the region, street address, and zip code of where the property is located.
Prices: The asking price of the property.
Bed: Number of beds
Bath: Number of baths, both full and half.
Sq_ft: The space of the property, in square feet.
Acres: The total number of acres for the whole lot in which the property is situated.
Obviously, the data is not tidy enough to be plotted…so now comes the cleaning:

library(dplyr)
library(stringr) 

comp_df -> comp 

#remove Prefix from titles
comp[,1] %>% str_replace_all(pattern = "Home For Sale - ", replacement = "") -> titles
comp[,1] = titles 

#remove dollar signs and commas
comp[,2] %>%
  str_replace_all(pattern = "\\$", replacement = "") %>%
  str_replace_all(pattern = ",", replacement = "") %>%
  as.integer() -> new_prices 

comp[,2] = new_prices 

#remove characters and convert to factor
comp[,3] %>%
  str_replace_all(pattern = "beds", replacement = "") %>%
  str_replace_all(pattern = "bed", replacement = "") %>%
  as.factor() -> beds 

comp[,3] = beds 

comp[,4] %>%
  str_replace_all(pattern = "fullbaths", replacement = "FULL") %>%
  str_replace_all(pattern = "fullbath", replacement = "FULL") %>%
  str_replace_all(pattern = "full", replacement = "FULL") %>%
  str_replace_all(pattern = "halfbaths", replacement = "HALF") %>%
  str_replace_all(pattern = "halfbath", replacement = "HALF") %>%
  str_replace_all(pattern = ",", replacement = " - ") %>%
  str_replace_all(pattern = "baths", replacement = "FULL") %>%
  str_replace_all(pattern = "bath", replacement = "FULL") %>%
  as.character() -> baths 

comp[,4] = baths 

#square feet
comp[,5] %>%
  str_replace_all(pattern = ",", replacement = "") %>%
  str_replace_all(pattern = " sq ft", replacement = "") %>%
  as.integer() -> sq_ft 

comp[,5] = sq_ft 

#acres
comp[,6] %>%
  str_replace_all(pattern = " acres lot", replacement = "") %>%
  as.numeric() -> acres_lot 

comp[,6] = acres_lot 

#split full and half baths
comp$full_baths = NA
comp$half_baths = NA 

for(i in 1:nrow(comp)){ 

  if(grepl(" - ", as.character(comp[i,4]))){ 

    regexpr(" - ", as.character(comp[i,4]))[[1]][1] -> start 

    substr(comp[i,4], start+3, nchar(as.character(comp[i,4]))) -> half_b
    substr(comp[i,4], 0, start - 1) -> full_b 

    comp$half_baths[i] = half_b 

    comp$full_baths[i] = full_b 

  }else{  

    comp$full_baths[i] = comp[i,4] 

    }
} 

#split title into zip code, location, and street address
comp$zip = NA
comp$location = NA
comp$strt_address = NA 

for(i in 1:nrow(comp)){ 

  substr(comp[i,1], nchar(comp[i,1]) - 4, nchar(comp[i,1])) %>%
    as.integer() -> comp$zip[i] 

  gregexpr(" - ", as.character(comp[i,1]))[[1]][1][1] -> start
  gregexpr(" - ", as.character(comp[i,1]))[[1]][2] -> end 

  substr(comp[i,1], start + 3, end - 1) -> Location 

  comp$location[i] = Location 

  substr(comp[i,1], 0, start - 1) -> street 

  comp$strt_address[i] = street 

} 

select(comp, location, strt_address, zip, bed, full_baths, half_baths, sq_ft, prices) -> Final_Complete 

write.csv(Final_Complete, "Final_Complete_NEW.csv", row.names = FALSE) 

#______________________________________________________________________________________ 

unique(Final_Complete[, "zip"]) %>% write.csv(file = "uniqe_zip.txt", row.names = FALSE) 

read.csv("Zip_Coords.txt", sep = ",") -> zip_coords 

#______________________________________________________________________________________ 

Final_Complete$long = NA
Final_Complete$lat = NA 

zip_coords$name %>%
  str_replace_all(pattern = ", United States", replacement = "") -> zip_coords$name 

#______________________________________________________________________________________ 

for(i in 1:nrow(Final_Complete)){ 

  for(j in 1:nrow(zip_coords)){ 

    if(as.character(zip_coords$name[j] == as.character(Final_Complete$zip[i]))){  

      Final_Complete$long[i] = zip_coords$longitude[j]
      Final_Complete$lat[i] = zip_coords$latitude[j]
      break 

      } 

  } 

} 

write.csv(Final_Complete, "Final_Complete_NEW.csv", row.names = FALSE)

First, I split the title post into three separate columns: the location, the street address, and the zip code. For the prices, I removed the dollar sign and the comma separating the thousands and converted it to an integer. For the bed column, I removed the string character and only left the numbers. Finally, I split the bath column into two columns: one for the number of full baths, and the second for the number of half baths.

I also would have liked coordinate data to have been available for each property, but I couldn’t find it in the site’s html source code. I did notice, however, that I have the zip codes for each post. So after a bit of googling, I came across a site called GPS Visualizer which basically converts any address in to longitude/latitude coordinate data. What’s more, it allows you to do this for a group of zip codes, provided that you supply your API key for whatever source that you choose.
After having exported the unique zip codes in my cleaned dataframe, I then pasted the information in to the site and got my coordinate data. I then loaded this coordinate data into R as a CSV file, and then combined it with my cleaned df. I then added another column in the data frame named long and lat, for longitude and latitude; respectively.

Although I only intended to scrape data that is limited to Minneapolis, I did get some other locations during the extraction. I’m not sure why, or perhaps they are indeed all situated in Minneapolis and this is just evidence of my ignorance of US cities. In any case, I’d be interested to see how many of the posts are not labeled as being situated in Minneapolis.

df -> df_plot 

select(df_plot, location) %>%
  group_by(location) %>% summarise(Count = length(location)) -> df_plot 

ggplot(df_plot, aes(x = reorder(location, Count), y = Count)) +
  geom_bar(stat = "identity", aes(fill=Count)) +  

  labs(x = "Location", y = "Number of Posts",
       title = "Number of Posts by Location") +  

  theme(axis.title.x = element_text(face="bold", size=14),
        axis.title.y = element_text(face="bold", size=14),
        axis.text.x = element_text(face="bold", colour="black"),
        axis.text.y = element_text(face="bold", colour="black"),
        plot.title = element_text(face="bold", colour="black")) +  

  coord_flip() 

#______________________________________________________________________ 

df -> df_plot 

select(df_plot, zip) %>%
  group_by(zip) %>% summarise(Count = length(zip)) -> df_plot 

ggplot(df_plot, aes(x = reorder(zip, Count), y = Count)) +
  geom_bar(stat = "identity", aes(fill=Count)) +  

  labs(x = "Zip Code", y = "Number of Posts",
       title = "Number of Posts by Zip") +  

  theme(axis.title.x = element_text(face="bold", size=14),
        axis.title.y = element_text(face="bold", size=14),
        axis.text.x = element_text(face="bold", colour="black"),
        axis.text.y = element_text(face="bold", colour="black"),
        plot.title = element_text(face="bold", colour="black"),
        legend.position = "none") +  

  coord_flip()

2016-06-05 23_42_59-Plot Zoom2016-06-08 10_47_49-Plot Zoom
That zip code graph looks a little noisy. Perhaps if I plotted that one on a map would be a little a better.

 

library(ggmap)
library(ggplot2) 

MN = c(min(Final_Complete$long), min(Final_Complete$lat), max(Final_Complete$long), max(Final_Complete$lat)) 

get_map(location = MN, source = "google", maptype = "roadmap", crop = FALSE, color = "bw") -> MN_map 

ggmap(MN_map) +
  stat_density2d(data=df, aes(x = long, y = lat,
                                          fill = ..level..,
                                          alpha = ..level..), geom = "polygon") +  

  scale_fill_gradient(low = "#3BE819", high = "#B5170B") +  

  theme(axis.text=element_blank(), panel.grid=element_blank(), axis.title=element_blank(),
        axis.ticks.x=element_blank(), axis.ticks.y=element_blank(), legend.position="none",
        panel.background=element_blank(), plot.title = element_text(face="bold", colour="black")) +  

  labs(title = "Density plot by zip code") +  

  scale_alpha_continuous(range=c(0.1,0.4)) 

2016-06-05 23_59_01-Plot Zoom
I think it’ll also be very interesting to know what’s the average price of a property for each location and zip code.

 

df -> df_plot 

filter(df_plot, prices <= 2000000) -> df_plot 

ggplot(df_plot, aes(x = reorder(location,prices),
                    y = prices)) + geom_boxplot(fill = "orange") +  

  labs(x = "Zip Code", y = "Prices",
       title = "Boxplot: Range and Average House Prices by Location") +  

  theme(axis.title.x = element_text(face="bold", size=14),
        axis.title.y = element_text(face="bold", size=14),
        axis.text.x = element_text(face="bold", colour="black", angle = 30),
        axis.text.y = element_text(face="bold", colour="black")) +  

  scale_y_continuous(labels = comma) 

#__________________________________________________________________ 

df -> df_plot 

filter(df_plot, prices <= 2000000) -> df_plot 

ggplot(df_plot, aes(x = reorder(zip,prices),
                    y = as.numeric(prices))) + geom_boxplot(fill = "green") +  

  labs(x = "Zip Code", y = "Prices",
       title = "Boxplot: Range and Average House Prices by Zip Code") +  

  theme(axis.title.x = element_text(face="bold", size=14),
        axis.title.y = element_text(face="bold", size=14),
        axis.text.x = element_text(face="bold", colour="black", angle = 45),
        axis.text.y = element_text(face="bold", colour="black")) +  

 scale_y_continuous(labels = comma)

2016-06-07 00_15_00-Plot Zoom2016-06-07 00_17_03-Plot Zoom
Finally, it’ll be interesting to see how the plots would look like using multiple dimensions.

 

df -> df_plot 

df_plot$bed = as.integer(df_plot$bed) 

filter(df_plot, prices >= 150000, prices <= 300000,          bed >= 4, bed <= 6, full_baths != "4 FULL",          full_baths != "5 FULL") -> df_plot 

df_plot$bed = as.factor(df_plot$bed) 

ggplot(df_plot, aes(x = reorder(location,prices),
                    y = prices)) + geom_boxplot(aes(fill = bed)) +  

  labs(x = "Location", y = "Prices",
       title = "Boxplot by Location: Prices VS No. of bedrooms VS No. of Full Bathrooms") +  

  theme(axis.title.x = element_text(face="bold", size=14),
        axis.title.y = element_text(face="bold", size=14),
        axis.text.x = element_text(face="bold", colour="black", angle = 30),
        axis.text.y = element_text(face="bold", colour="black")) +  

  scale_y_continuous(labels = comma) + facet_grid(~full_baths) + coord_flip()

2016-06-08 10_41_30-Plot Zoom

There are plenty of things with which I could do with this data, especially if you consider the machine learning possibilities. If one could get a hold of the coordinate data of all the schools in each zip code, and perhaps the crime rates in each location…you could come up with a reasonable model that can determine the price of a property using a given set of variables.
But then again, machine learning has never been something I excelled at. I’ve always been interested on how to accurately scrape and clean the data from a given site. Perhaps one day I’ll come up with my very own model instead of just asking others to try and do it. Till then…i’ll leave it to you guys.

Posted in Uncategorized | Tagged , , , , , , , , , , , | Leave a comment

Swedish Real Estate Prices | rvest, RSelenium, ggplot2

Finally got myself an internet connection at home, so i can start with another scraping project. This time i thought it would be nice to try a non-English site that also has coordinate data in each post. Settled on a site and started cracking.

As always, i began with inspecting the site’s html source code. I was hoping that the site doesn’t incorporate any javascript, but unfortunatley it does. That means i’m gonna have to use RSelenium again in order to simulate some kind of human interaction. In this case, i need it to click the Next button in the search results so that i could scrape the next set of search results. One issue that i wasn’t expecting is the coordinate data being mushed together with other information that i didn’t need, but nothing the stringr package can’t fix.

Since most of the information for each real estate posting is included in the post itself and not in the search results, the first loop would need to get all the links, while the second loop would have to actually navigate to each of these links and retrieve all the information available.

The code i used for the scraping is below:

library(dplyr)
library(RSelenium)
library(rvest)
library(stringr)
library(beepr) 

link = "Link to first page of results goes here"
page = 1
link = paste(link, as.character(page), sep ="") 

checkForServer() #Check if server file is available
startServer() #Start the server
mybrowser = remoteDriver(browser = "chrome") #Change the browser to chrome
mybrowser$open(silent = TRUE) #Open the browser 

mybrowser$navigate(link) #Navigate to URL 

#find button and click
wxbutton = mybrowser$findElement(using = 'css selector', "div.map-toolbar__item.map-toolbar__item--fullscreen.tooltip")
wxbutton$clickElement() 

#get the source
mybrowser$getPageSource()[[1]] -> source 

newlist = c("ä", "ö", "ü", "Ä", "Ö", "Ü", "ß")
newlist_2 = c("a", "o", "u", "A", "O", "U", "B") 

swedChar = function(source, newlist, newlist_2){ 

  for(i in 1: length(newlist)){ 

    source %>% str_replace_all(pattern = newlist[i], replacement = newlist_2[i]) -> source 

  }
  return(source)
} 

swedChar(source, newlist, newlist_2) -> source 

html(source) -> source 

source %>% html_nodes("#hitlists") %>% html_nodes("#hitlist") %>%
  html_nodes("a.hit__anchor.js__hit__anchor") %>% html_attr("href") -> Links #links 

source %>% html_nodes("#hitlists") %>% html_nodes("#hitlist") %>%
  html_nodes("a.hit__anchor.js__hit__anchor") %>% html_attr("data-meta") -> meta #meta with long/lat coord 

meta %>% str_replace_all(pattern = "\"", replacement = "") %>%
  str_replace_all(pattern = ":", replacement = "") -> meta #remove backspace and colon 

meta %>% str_locate("longitude") -> x #character location of "longitude" 

x[,1] %>% as.integer() -> start #where "longitude" begins
x[,2] %>% as.integer() -> end #where "longitude" ends 

#locate end of "longitude" and take next 10 characters, then convert numeric
substr(meta, end+1, end+10) %>% as.numeric() -> Longitudes 

#do same for "latitude"
meta %>% str_locate("latitude") -> x 

x[,1] %>% as.integer() -> start
x[,2] %>% as.integer() -> end 

substr(meta, end+1, end+10) %>% as.numeric() -> Latitudes 

data.frame(Links, meta, Latitudes, Longitudes) -> df 

for(i in 2:710){ #710 calculated based on total search results 

  link = "Linke goes here again"
  page = i
  link = paste(link, as.character(page), sep ="") 

  wxbutton = mybrowser$findElement(using = 'css selector', "div.map-toolbar__item.map-toolbar__item--fullscreen.tooltip")
  wxbutton$clickElement() 

  mybrowser$navigate(link)
  Sys.sleep(5) 

  mybrowser$getPageSource()[[1]] -> source 

  swedChar(source, newlist, newlist_2) -> source 

  html(source) -> source 

  source %>% html_nodes("#hitlists") %>% html_nodes("#hitlist") %>%
    html_nodes("a.hit__anchor.js__hit__anchor") %>% html_attr("href") -> Links #links 

  source %>% html_nodes("#hitlists") %>% html_nodes("#hitlist") %>%
    html_nodes("a.hit__anchor.js__hit__anchor") %>% html_attr("data-meta") -> meta #meta with long/lat coord 

  meta %>% str_replace_all(pattern = "\"", replacement = "") %>%
    str_replace_all(pattern = ":", replacement = "") -> meta #remove backspace and colon 

  meta %>% str_locate("longitude") -> x #character location of "longitude" 

  x[,1] %>% as.integer() -> start #where "longitude" begins
  x[,2] %>% as.integer() -> end #where "longitude" ends 

  #locate end of "longitude" and take next 10 characters, then convert numeric
  substr(meta, end+1, end+10) %>% as.numeric() -> Longitudes 

  #do same for "latitude"
  meta %>% str_locate("latitude") -> x 

  x[,1] %>% as.integer() -> start
  x[,2] %>% as.integer() -> end 

  substr(meta, end+1, end+10) %>% as.numeric() -> Latitudes 

  data.frame(Links, meta, Latitudes, Longitudes) -> df_2 

  rbind(df, df_2) -> df 

} 

beep()
Sys.sleep(0.2)
beep()
Sys.sleep(0.2)
beep()
Sys.sleep(0.2)
write.csv(df, "Final_File.csv", row.names = FALSE) 

#______________Create list with property details________________________ 

fURL = "URL to the site's homepage" 

link = paste(fURL, df[1,1], sep = "") 

mybrowser$navigate(link) 

mybrowser$getPageSource()[[1]] -> source 

swedChar(source, newlist, newlist_2) -> source 

html(source) -> source 

wxbutton = mybrowser$findElement(using = 'css selector', 'article.property__simple-number.property__show-more')
wxbutton$clickElement() 

source %>% html_nodes(".decks") %>%
  html_nodes(".property__detailed-numbers") %>% html_nodes(".property__table-cell") %>%
  html_text() %>% str_trim() -> details 

list(details) -> details 

for(i in 1:nrow(df)){ 

  fURL = "Home page URL again" 

  link = paste(fURL, df[i,1], sep = "") 

  mybrowser$navigate(link) 

  try(mybrowser$getPageSource()[[1]] -> source) 

  swedChar(source, newlist, newlist_2) -> source 

  try(html(source) -> source) 

  try(wxbutton = mybrowser$findElement(using = 'css selector', 'article.property__simple-number.property__show-more'))
  try(wxbutton$clickElement()) 

  try(source %>% html_nodes(".decks") %>%
    html_nodes(".property__detailed-numbers") %>% html_nodes(".property__table-cell") %>%
    html_text() %>% str_trim() -> details_1) 

  details[[i]] -> details_1 

  print(i)
} 

#close browser and server
mybrowser$close()
mybrowser$closeServer() 

#End of loop notification
beep()
Sys.sleep(0.2)
beep()
Sys.sleep(0.2)
beep()
Sys.sleep(0.2) 

#_______________Get unique header names___________________________ 

seq(1,length(details[[1]]), by = 2) -> header_labels 

details[[1]][header_labels] -> headers 

for(i in 2:length(details)){ 

  try(seq(1,length(details[[i]]), by = 2) -> header_labels) 

  try(details[[i]][header_labels] -> headers_1) 

  try(headers = c(headers, headers_1)) 

} 

headers %>% unique() %>% na.omit() %>% as.character() -> full_headers
#______________Create data frame with header names__________________ 

matrix(data = NA, nrow = nrow(df), ncol = length(full_headers)) %>%
  data.frame() -> Final_df 

names(Final_df) = full_headers 

#______________Populate data frame___________________________________ 

for(i in 1:length(details)){ 

  for(j in 1:length(full_headers)){ 

    v_header = details[[i]] 

    sum(v_header == full_headers[j]) -> check 

    if(check != 0){  

      value = which(v_header == full_headers[j]) 

      if(length(value) > 1){value = value[1]} 

      value = value + 1 

      Final_df[i,full_headers[j]] = v_header[value] 

      } 

  } 

} 

#_______________Export Final File____________________________________
write.csv(Final_df, "Final_Details_File.csv", row.names = FALSE) 

I had a tough time trying to understand what the hell is “html encoding”. Apparently, you can’t just scrape a foreign language website like you would one that’s in English. RSelenium gets the page source from a site as class character, so I just replaced all the Swedish special characters with English ones and only then did I use rvest for extracting the information that I needed. By the end of the script, i was left with 20 variables; some of which i don’t understand, and some of which are pretty straight forward. All variable names were translated from their original Swedish and since I don’t know the first thing about Swedish, i just used google translate. The variables are:

1. Link: The link to the actual post
2. meta: The meta data that includes the longitudes and latitudes, among other things.
3. Latitude: Latitudes isolated from meta data
4. Longitudes: Same as the latitudes
5. Property_Type: Pretty much self explanatory. It ranges from “Apartment”, to “Plot_Land”.
6. Rooms: The number of rooms in the property.
7. Living_Area: The figures are in square meters, and relate to only the living space in the property.
8. Flat_Floor: If the property is on a floor, the floor number.
9. Page_Views
10. Mobile_Coverage: I’m guessing this means the telecommunications company providing service in the area.
11. Estimate: The estimating price of the property., in SEK.
12. Estimate_Per_SqM: The price of per square meter of the living area.
13. Fee
14. Operating_Cost
15. Year_Built: The year the property was constructed.
16. Distance_from_Water: I was honestly curious about this one. Probably means the distance from a lake or an ocean…?
17. Plot: This is the total area of the property as a whole, not to be confused with the Living_Area variable. This is also in square meters.
18. Agency: The real estate agency who are arranging the transactions.
19. Price_Change: Any changes in the estimate hat was originally posted. Currency is in SEK.
20. Date_Modified: The date in which the post was modified.

There was one more variable that i managed to extract but i couldn’t translate from the original Swedish, so i just excluded it. I had to do all the translations manually in R when i was cleaning the data. The final data frame has roughly 24000 rows of data. The script i used for the data cleaning/tidying is below:

library(dplyr)
library(RSelenium)
library(rvest)
library(stringr)
library(beepr) 

Final_Clean = Final_df 

head(Final_Clean) 

#_________________Rename Columns__________________________________________ 

new_headers = c("Property_Type", "Rooms", "Living_Area", "Flat_Floor_Number",
                 "Page_Views", "Mobile_Coverage", "Estimate", "Est_Sq_Met", "Fee",
                 "Operating_Cost", "Year_Built", "Distance_To_Water", "Plot",
                 "Biarea", "Agency", "Price_Change", "Modification_Date") 

names(Final_Clean) = new_headers 

#___________Separate All NA Rows_________________________________ 

NAs = c()
x = 0
for(i in 1:nrow(Final_Clean)){ 

  Final_Clean[i,] %>% is.na() %>% sum() -> Sum_NAs 

  if(Sum_NAs == ncol(Final_Clean)){  

    x = length(NAs) + 1 

    NAs[x] = i} 

} 

cbind(df, Final_Clean) -> Final_Clean 

Final_Clean[NAs,] -> Final_NAs 

Final_Clean[-NAs,] -> Final_Clean 

#_________________Property  Type________________________________________________ 

unique(Final_Clean$Property_Type) -> Types 

replacements = c("Lagenhet" = "Apartment", "Villa" = "Villa", "Radhus" = "Townhouse", "Tomt/Mark" = "Plot_Land",
                  "Fritidshus" = "Vacation_Home", "Kedjehus" = "Detached_House", "GÃ¥rd" = "Farm", "Parhus" = "Townhouse",
                  "Ovrigt" = "Other") 

replacements[Final_Clean$Property_Type] -> Final_Clean$Property_Type 

#________________Rooms____________________________________________________________ 

unique(Final_Clean$Rooms) %>%
  na.omit() %>% as.character() -> Types 

grepl("½", Types) -> log_Types 

str_replace_all(Types, "½", replacement = "") %>%
  str_replace_all(pattern = "rum", replacement = "") %>%
  as.integer() -> Types_int 

Types_int[log_Types] + 0.5 -> Types_int[log_Types] 

setNames(Types_int, Types) %>% as.table() -> replacements 

replacements[Final_Clean$Rooms] %>%
  as.numeric() -> Final_Clean$Rooms 

#_______________Remove unwanted characters________________________________________ 

c("m²", " tr", " st", " kr/m²", " kr/mån", " kr", " ", "/") -> replacements 

for(i in 5:ncol(Final_Clean)){ 

  for(j in 1:nrow(Final_Clean)){ 

    for(k in 1:length(replacements)){ 

      str_replace_all(Final_Clean[j,i],
                      pattern = replacements[k],
                      replacement = "") -> Final_Clean[j,i]
    } 

  } 

} 

beep()
Sys.sleep(0.2)
beep()
Sys.sleep(0.2) 

#_______________Standardize measurements for Distance to water______________________ 

Final_Clean$Distance_To_Water %>%
  str_replace_all(pattern = ",", replacement = "") -> Final_Clean$Distance_To_Water 

grepl("km", Final_Clean$Distance_To_Water) -> loc_km 

Final_Clean$Distance_To_Water %>%
  str_replace_all(pattern = "km", replacement = "") %>%
  str_replace_all(pattern = "m", replacement = "") %>%
  as.numeric() -> figures 

Final_Clean$Distance_To_Water = figures 

Final_Clean$Distance_To_Water[!loc_km] = figures[!loc_km]/1000 

#_______________changes to appropriate classes_______________________________________ 

names(Final_Clean) %>% as.data.frame() 

for(i in c(3, 4,11, 12, 13, 14, 20)){ 

  Final_Clean[,i] = as.numeric(Final_Clean[,i]) 

} 

for(i in c(6, 7, 8, 9, 15, 17)){ 

  Final_Clean[,i] = as.integer(Final_Clean[,i]) 

} 

for(i in c(5, 10, 19)){ 

  Final_Clean[,i] = as.factor(Final_Clean[,i]) 

} 

#check
str(Final_Clean)

Alright, now that i have the data, time to somehow get a hold of a map of Sweden and plot the data on the map.

I hate this part, because other than my previous post about plotting Manhattan rentals; i don’t know the first thing about geographic files/data.

It took me a while to find a reasonably simple file. The first file i got was from the European Environment Agency. The file was just too gigantic for me to load onto R. I’m embarrassed to say that my feeble desktop PC actually crashed trying to load this. Tried on my laptop, and it quite literally took an entire day for the load process to finish.

Luckily i found the one that was put up on the Global Administrative Areas site. You just need to select the country, and then pick one of the download links provided.

Now that i have all the information i need, time for some plots.

2016-03-31 02_12_13-Plot Zoom

2016-03-31 02_38_40-Plot Zoom

Yeah. That second plot doesn’t look as insightful as i was hoping it would be. It looks like the Apartment levels seem to have distorted the scale of the density. I’m guessing that’s why the other property types all show light shades of green and no red.
Probably more useful to check the average prices of each Property Type…

2016-03-31 02_40_33-Plot Zoom
I expected the Villa to have the highest average, but to be honest…i don’t know the difference between a Townhouse and a Villa. I limited the dataset by filtering only Estimates of SEK 1M or less, and Rooms 5 or less. The Villa category seems to be the only Property Type that shows a logical progression of Estimate VS number of Rooms. I couldn’t figure out how to somehow map the year in which the property was built, so instead I tried to plot the locations of all the posts, and map the colors by how expensive the property is.

2016-03-31 02_44_05-Plot Zoom

Yeah. That, also, was not as visually pleasing as I thought it would be. Plus, I couldn’t figure out how to plot the Rooms variables somehow.
Anywho, there’s not much I can do with all this data, mainly because my R game is not as good as I would like it to be – still need an awful lot of work. But if anyone out there would like to get their hands on the data, let me know and I’ll send you a copy.

Posted in Uncategorized | Tagged , , , , , , , , , , , | Leave a comment

Navigating & Scraping a Job Site | rvest & RSelenium

One of my family members gave me an idea to perhaps try scraping data from a job site, and arranging the data in a way that can then easily be filtered and checked using a spreadsheet. I’m actually a little embarrassed that i didn’t think of this idea myself. Needless to say, i was anxious to try this out.

I picked a site and started inspecting the HTML code to see how would i get the information i needed from each job posting. Normally, the easiest scrapes (for me) are the ones where the site is structured with two characteristics.

First, it helps if all (or at least most) of the information that i need to extract is in the site’s search results page. For instance, in the context of job postings, if you search for “Data Scientist”, and the search results show the job title, the company that’s hiring, the years of experience required, the location, and a short summary – then there is no real need to navigate to each post and get that data from the post itself.

Second characteristic is if the URL of the search results shows the result page number that you are currently in – or even shows any indication of which search result number i am looking at. For instance, google “Data Scientist” and the take note of the URL. Scroll down and click the second page, and notice that the URL now ends with “start=10”. Go to the third page and you’ll notice that the it now ends with “start=20”. Although it doesn’t mention which page, it does indicate that if you were to change those last two digits to anything (go ahead and try), the search results would begin from start + 1; i.e. if start = 10, the search results would begin with search result no. 11. If i’m lucky, some websites have clear indications in the URL, like “page=2”, which makes the task even more easier.

Now why would these two characteristics make it much easier? Mainly because you can split the URL into different parts, with only one variable – the page number – and then concatenate the different parts back. After that it’s just a matter of looping through these URLs and picking up the information you need from the HTML source.

If the above two characteristics exist, all i need is the rvest package to make it all work, with dplyr and stringr for some of the “tidying”.

There are certain instances however, when both of these characteristics do not exist. It’s usually because the site incorporates some javascript and so the URL does not change when going through different search pages. This means that in order to make this work, i would actually have to click the page buttons in order to get the HTML source – and i can’t do that with rvest.

Enter RSelenium. The wonderful R package that allows me to do all that.

As always i started off with loading the packages, assigning the URL for the search result page, and extracting the data for just the first page. You’ll have to excuse me for using the “=” operator. WordPress seems to screw up the formatting if i use the “less than” operator combined with a hyphen; which is sort of annoying.

#Load packages
library(dplyr)
library(rvest)
library(stringr)
library(RSelenium)
library(beepr)

#Manually paste the URL for the search results here
link = "jobsite search results URL here"

#Get the html source of the URL
hlink = html(link)

#Extract Job Title
hlink %>% html_nodes("#main_section") %>% 
  html_nodes(".tpjob_item") %>% html_nodes(".tpjob_title") %>% 
  html_text() %>% data.frame(stringsAsFactors = FALSE) -> a
names(a) = "Title"

#Extract Recruitment Company
hlink %>% html_nodes("#main_section") %>% 
  html_nodes(".tpjobwrap") %>% html_nodes(".tpjob_cname") %>% html_text() %>% 
  data.frame(stringsAsFactors = FALSE) -> b
names(b) = "Company"

#Extract Links to Job postings
hlink %>% html_nodes("#main_section") %>% 
  html_nodes(".tpjob_item") %>% html_nodes(".tpjob_lnk") %>% 
  html_attr("href") %>% data.frame(stringsAsFactors = FALSE) -> c
names(c) = "Links"

At this point i’ve only extracted the job titles, the hiring company’s name, and the link to the post. In order for me to get the same details for the remaining posts, i would need to first navigate to the next page, which involves clicking the Next button at the bottom of the search results page.


#From RSelenium
checkForServer() #Check if server file is available
startServer() #Start the server
mybrowser = remoteDriver(browser = "chrome") #Change the browser to chrome
mybrowser$open(silent = TRUE) #Open the browser
Sys.sleep((5)) #Wait a few seconds
mybrowser$navigate(link) #Navigate to URL
Sys.sleep(5) 

Pages = 16 #Select how many pages to go through

for(i in 1:Pages){ 
  
  #Find the "Next" button and click it
  try(wxbutton = mybrowser$findElement(using = 'css selector', "a.pagination_item.next.lft"))
  try(wxbutton$clickElement()) # Click
  
  Sys.sleep(8)
  
  hlink = html(mybrowser$getPageSource()[[1]]) #Get the html source from site
  
  hlink %>% html_text() -> service_check
  
  #If there is a 503 error, go back
  if(grepl("503 Service", service_check)){ 
    
    mybrowser$goBack()
    
  }
  else
  {
    
    #Job Title
    hlink %>% html_nodes("#main_section") %>% 
      html_nodes(".tpjob_item") %>% html_nodes(".tpjob_title") %>% 
      html_text() %>% data.frame(stringsAsFactors = FALSE) -> x
    names(x) = "Title"
    a = rbind(a,x) #Add the new job postings to the ones extracted earlier
    
    #Recruitment Company
    hlink %>% html_nodes("#main_section") %>% 
      html_nodes(".tpjobwrap") %>% html_nodes(".tpjob_cname") %>% html_text() %>% 
      data.frame(stringsAsFactors = FALSE) -> y
    names(y) = "Company"
    b = rbind(b,y)
    
    #Links
    hlink %>% html_nodes("#main_section") %>% 
      html_nodes(".tpjob_item") %>% html_nodes(".tpjob_lnk") %>% 
      html_attr("href") %>% data.frame(stringsAsFactors = FALSE) -> z
    names(z) = "Links"
    c = rbind(c,z)
    
  }
  
}

beep()

#Put everything together in one dataframe
compile = cbind(a,b,c)

#export a copy, for backup
write.csv(compile, "Backup.csv", row.names = FALSE)

#close server and browser
mybrowser$close()
mybrowser$closeServer()

Now that i have all the links to the posts, i can now loop through the previously compiled dataframe and get all the details from all the URLS.


#Make another copy to loop through
compile_2 = compile

#Create 8 new columns to represent the details to be extracted
compile_2$Location = NA
compile_2$Experience = NA
compile_2$Education = NA
compile_2$Stream = NA
compile_2$Function = NA
compile_2$Role = NA
compile_2$Industry = NA
compile_2$Posted_On = NA

#3 loops, 2 in 1
#First loop to go through the links extracted
for(i in 1:nrow(compile_2)){
  
  hlink = ""
  
  link = compile_2$Links[i]
  
  try(hlink = html(link))
  
  if(html_text(hlink) != ""){
  
        hlink %>% html_nodes(".jd_infoh") %>% 
          html_text() %>% data.frame(stringsAsFactors = FALSE) -> a_column
        
        hlink %>% html_nodes(".jd_infotxt") %>% 
          html_text() %>% data.frame(stringsAsFactors = FALSE) -> l_column
   
  if(nrow(a_column) != 0){      
             
        #Second loop to check if the details are in the same order in each page
        for(j in nrow(l_column):1){
          
          if(nchar(str_trim(l_column[j,1])) == 0){l_column[-j,] %>% data.frame(stringsAsFactors = FALSE) -> l_column}
          
        }
         
    if(nrow(a_column) == nrow(l_column)){
    
        cbind(a_column, l_column) -> comp_column
        
        #Third loop to update dataframe with all the details from each post
        for(k in 1:nrow(comp_column)){
          
          if(grepl("Location", comp_column[k,1])){compile_2$Location[i] = comp_column[k,2]} 
          
          if(grepl("Experience", comp_column[k,1])){compile_2$Experience[i] = comp_column[k,2]}
          
          if(grepl("Education", comp_column[k,1])){compile_2$Education[i] = comp_column[k,2]}
          
          if(grepl("Stream", comp_column[k,1])){compile_2$Stream[i] = comp_column[k,2]}
          
          if(grepl("Function", comp_column[k,1])){compile_2$Function[i] = comp_column[k,2]}
          
          if(grepl("Role", comp_column[k,1])){compile_2$Role[i] = comp_column[k,2]}
          
          if(grepl("Industry", comp_column[k,1])){compile_2$Industry[i] = comp_column[k,2]}
          
          if(grepl("Posted", comp_column[k,1])){compile_2$Posted_On[i] = comp_column[k,2]}
        }
  
  }
  }
  }
}

beep()

#Export a copy for backup
write.csv(compile_2, "Raw_Complete.csv", row.names = FALSE)

#Alert
beep()
Sys.sleep(0.2)
beep()
Sys.sleep(0.2)
beep()
Sys.sleep(0.3)
beep(sound = 8) #That one's just me goofing around

Alright, we now have a nice dataframe of 1840 jobs and 11 columns showing:

1. Job Title
2. Company: The hiring company.
3. Links: The URL of the job posting.
4. Location: Where the job is situated.
5. Experience: Level of experience required for the job, shown as a range (e.g. 2-3 years)
6. Education: Minimum educational qualification.
7. Stream: Work stream category.
8. Function: Job function category
9. Role: Then job’s general role.
10. Industry: Which industry the hiring company is involved in.
11. Posted_On: The day the job was originally posted.

As a matter of convenience, i decided to split the 5th column, Experience, into two other columns:

12. Min: Minimum years of experience required.
13. Max: Maximum years of experience.

The code used to mutate this Experience column was:

com_clean = compile_2

#logical vector of all the observation with no details extracted because of error
is.na(com_clean[,4]) -> log_vec


#Place NA tows in separate dataframe
com_clean_NA = com_clean[log_vec,]

#Place the remaining in onther dataframe
com_clean_OK = com_clean[!log_vec,]


com_clean_OK[,"Experience"] -> Exp

#Remove whitespace and the "years" part
str_replace_all(Exp, " ", "") %>% 
  str_replace_all(pattern = "years", replacement = "") -> Exp

#Assign the location of the hyphen to a list
str_locate_all(Exp[], "-") -> hyphens

#Assign empty vectors to be populated with a loop
Min = c()
Max = c()

for(i in 1:length(Exp)){
  
  substr(Exp[i], 0, hyphens[[i]][1,1] - 1) %>% 
    as.integer() -> Min[i]
  
  substr(Exp[i], hyphens[[i]][1,1] + 1, nchar(Exp[i])) %>% 
    as.integer() -> Max[i]
  
}

#Assign results to new columns
com_clean_OK$Min_Experience = Min
com_clean_OK$Max_Experience = Max

#Rearrange the columns
select(com_clean_OK, 1:4, 12:13, 5:11) -> com_clean_OK

write.csv(com_clean_OK, "Complete_No_NA.csv", row.names = FALSE)
write.csv(com_clean_NA, "Complete_All_NA.csv", row.names = FALSE)

And with that, i have a nice dataframe of all the information i need to go through the posts. I was flirting with the idea of even trying to compile some code that would automatically apply for a job if it meets certain criteria, e.g. if a job title equals X, minimum experience is less than Y, and location is in a list of Z; then click this and, so on. Obviously, there is the question of how to go through the Captcha walls, as a colleague had once highlighted. In any case, i thought i should leave this idea for a different post. Till then, i’ll be involved in some intense googling to see if someone else has actually tried it out (using R, or even Python) and maybe pick up a few things.

Posted in Uncategorized | Tagged , , , , , , , , , | 5 Comments

Manhattan Rental Rates | Plotting Rental and Spatial Data

I came across this post written by a James Cheshire that introduces people to the topic of spatial data using R. I don’t know a single thing of using spatial information so i thought i should give it a shot.

I kept looking for local Malaysian sites that show user posts of any kind that might include coordinates so that i might start plotting the information. Couldn’t find anything, so a random site as a sample. I couldn’t just plot data from every city in the US so i settled for New York – Manhattan to be specific.

As always, inspected the HTML code on the website and it was kind of nice to see that the code was a whole lot more structured than i had thought it would be. The information that is put up with each post is much more comprehensive than what you will normally see in Malaysian websites. For instance, the post details out whether the tenant is allowed to have pets, or small pets, or cats only; and if the laundry is “in-unit” or in a laundry room; whether there is a doorman and if there is some outdoor space; etc. The photos posted are also very clear and thorough, and more importantly…there is a map showing where the unit is located. The map is dynamic and interactive, i.e. it switches the location automatically according to which post you mouse over.

After finding out how to get the coordinate data from the posts, i came up with the code below that would scrape through select information from each post:

library(dplyr)
library(rvest)
library(stringr)
 
link = #"sample website URL here"
pg = 1
link = paste(link, as.character(pg), sep = "")
htLink = html(link)
 
#Last page on search results
htLink %>% html_nodes("#paginationBottom") %>% html_text() -> x
 
str_replace_all(x, "\n", "") -> x
str_replace_all(x, " ", "") -> x
 
str_locate_all(pattern = "...", x[1]) -> l
 
#***NOTE: Might not always work depending on how many search results there are***
l[[1]][3,1] %>% as.integer() -> m
l[[1]][3,2] %>% as.integer() -> n
substr(x, m+1,nchar(x)-1) %>% as.integer() -> limit
#***__________________________________________________________________________***
 
df = data.frame()
 
#Link to post
htLink %>% html_nodes(".listing-results") %>% 
  html_nodes(".listing-title") %>% html_attr("href") -> a
 
y <- seq(1,length(a), by=2)
a[y] -> a
df[1:length(a),1] = a
 
#Rental rate
htLink %>% html_nodes(".listing-results") %>% 
  html_nodes(".listing-title") %>% html_text() -> b
 
y <- seq(1,length(b), by=2)
b[y] -> b
df[1:length(b),2] = b
 
#Location
htLink %>% html_nodes(".listing-results") %>% 
  html_nodes(".listing-address") %>% html_text() -> c
 
y <- seq(1,length(c), by=2)
c[y] -> c
df[1:length(c),3] = c
 
#Layout
htLink %>% html_nodes(".listing-results") %>% 
  html_nodes(".listing-size") %>% html_text() -> d
 
y <- seq(1,length(d), by=2)
d[y] -> d
df[1:length(d),4] = d
 
#Amenities
htLink %>% html_nodes(".listing-results") %>% 
  html_nodes(".amenities") %>% html_text() -> e
 
str_replace_all(e, "\n", "") -> e
str_replace_all(e, "  ", "") -> e
df[1:length(e),5] = e
 
#COORDINATES
#Latitude
htLink %>% html_nodes("div") %>% html_attr("data-latitude") -> f
f[!is.na(f)] -> f
df[1:length(f),6] = f
#Longitude
htLink %>% html_nodes("div") %>% html_attr("data-longitude") -> g
g[!is.na(g)] -> g
df[1:length(g),7] = g
 
#________LOOP STARTS HERE____________________________________________
 
df2 = data.frame()
 
for(i in 2:limit){
   
link = #"sample website URL link here"
pg = i
link = paste(link, as.character(pg), sep = "")
htLink = html(link)
 
#Link to post
htLink %>% html_nodes(".listing-results") %>% 
html_nodes(".listing-title") %>% html_attr("href") -> a
 
y <- seq(1,length(a), by=2)
a[y] -> a
df2[1:length(a),1] = a
 
 
#Rental rate
htLink %>% html_nodes(".listing-results") %>% 
  html_nodes(".listing-title") %>% html_text() -> b
 
y <- seq(1,length(b), by=2)
b[y] -> b
df2[1:length(b),2] = b
 
#Location
htLink %>% html_nodes(".listing-results") %>% 
  html_nodes(".listing-address") %>% html_text() -> c
 
y <- seq(1,length(c), by=2)
c[y] -> c
df2[1:length(c),3] = c
 
#Layout
htLink %>% html_nodes(".listing-results") %>% 
  html_nodes(".listing-size") %>% html_text() -> d
 
y <- seq(1,length(d), by=2)
d[y] -> d
df2[1:length(d),4] = d
 
#Amenities
htLink %>% html_nodes(".listing-results") %>% 
  html_nodes(".amenities") %>% html_text() -> e
 
str_replace_all(e, "\n", "") -> e
str_replace_all(e, "  ", "") -> e
df2[1:length(e),5] = e
 
#COORDINATES
#Latitude
htLink %>% html_nodes("div") %>% html_attr("data-latitude") -> f
f[!is.na(f)] -> f
df2[1:length(f),6] = f
#Longitude
htLink %>% html_nodes("div") %>% html_attr("data-longitude") -> g
g[!is.na(g)] -> g
df2[1:length(g),7] = g
 
rbind(df, df2) -> df
 
df2 = data.frame()
 
}
 
headers = c("Links", "Rental", "Location", "Layout", "Amenities", "Latitude", "Longitude")
names(df) = headers
 
write.csv(df, "Compiled_Ext.csv", row.names = FALSE)

The code didn’t really take that long to scrape the information. The dimensions of the dataframe that the code spits out is 39,223 rows by 7 columns. The 7 columns are:
1. The link to the post (Link)
2. The rental, as displayed in the website (Rental)
3. The address (Location)
4. The layout of the house, i.e. number of rooms and bathrooms (Layout)
5. List of amenities (Amenities)
6. Latitude (Latitude)
7. Longitude (Longitude)

Obviously the raw data would need to be cleaned a little bit further, however i was sort of fixated on the Amenities column; mainly because it appears that this columns can be broken down to several other columns depending on whether a particular amenity was available or not.

That said, i made the following changed to the original extraction:
1. Changed the Rental column to class numeric by removing the comma and dollar sign and subsequently coercing all the figured to numeric. The new column is called Rental.Num.
2. The Layout column was then mutated into two separate columns: Bed and Bath. The value in each row of these new columns was coerced to class integer.
3. After looking at it a little more closely, it would appear that the Amenities column can be mutated into 13 new columns:

Pets: The type of pets that are allowed. The values are All, Cats, Approval, or Small Pets
Elevator: If there is an elevator in the building.
Gym: The availability of a gym in the building.
Fireplace: If the apartment comes with a fireplace.
Laundry: The availability of any laundry facilities. The values are either “LRoom”, which stands for laundry room; or “In-Unit”, which means the laundry room is inside the apartment.
Pre.Post.War: If the apartment was built pre or post WW-II. The values are either “Pre” or “Post”.
Doorman: If there is a doorman in the building.
Square.Feet: This a measure of the size of the apartment, measured in square feet. A lot of the people who posted on the website did not state flat out the size of the apartment. The values were extracted from the Amenities columns and then coerced into class numeric.
Parking: I’m not entirely certain if this means that each the apartment comes with a parking lot, or if there are parking lots that are available outside the building. The default values are the same.
Furnished: If the apartment is furnished or not.
Outdoor.Space: I did not understand the significance of this amenity too. It also has the same default values.

The code used to do this is below:

library(stringr)
library(dplyr)
 
df -> df2
 
nHeaders = c("Bed", "Bath", "Pets", "Elevator", "Gym", "Dishwasher", 
             "Fireplace", "Laundry", "Pre.Post.War", "Doorman", 
             "Square.Feet", "Parking", "Furnished", "Outdoor.Space")
 
for(i in 1:length(nHeaders)){
   
  df2[,ncol(df2)+1] = NA
   
}
 
names(df) -> oHeaders
Headers = c(oHeaders, nHeaders)
names(df2) = Headers
 
#Layout: Bed always comes before bath
bedBath <- df2[, "Layout"]
 
for(i in 1:length(bedBath)){
   
  gregexpr(",", bedBath[i])[[1]] -> loc
   
  bedBath[i] <- substr(bedBath[i], 0, loc-1)
   
}
 
str_replace_all(bedBath, "BR", "") -> bedBath
 
bedBath <- as.numeric(bedBath)
 
df2[,"Bed"] <- bedBath
 
bedBath <- df2[, "Layout"]
 
#Exact same code above, but for bath
for(i in 1:length(bedBath)){
   
  gregexpr(",", bedBath[i])[[1]] -> loc
   
  bedBath[i] <- substr(bedBath[i], loc+1, nchar(bedBath[i]))
   
}
 
str_replace_all(bedBath, "BA", "") -> bedBath
 
bedBath <- as.numeric(bedBath)
 
df2[,"Bath"] <- bedBath
 
#______
 
#Elevator, Gym, Dishwasher, Fireplace, Doorman, Parking, Furnished, OutdoorSpace
 
for(i in 1:nrow(df2)){
   
  if(grepl("elevator", df2[i,"Amenities"], ignore.case = TRUE)){df2[i,"Elevator"] <- "Y"}else{df2[i,"Elevator"] <- "N"}
   
  if(grepl("gym", df2[i,"Amenities"], ignore.case = TRUE)){df2[i,"Gym"] <- "Y"}else{df2[i,"Gym"] <- "N"}
   
  if(grepl("fireplace", df2[i,"Amenities"], ignore.case = TRUE)){df2[i,"Fireplace"] <- "Y"}else{df2[i,"Fireplace"] <- "N"}
   
  if(grepl("doorman", df2[i,"Amenities"], ignore.case = TRUE)){df2[i,"Doorman"] <- "Y"}else{df2[i,"Doorman"] <- "N"}
   
  if(grepl("parking", df2[i,"Amenities"], ignore.case = TRUE)){df2[i,"Parking"] <- "Y"}else{df2[i,"Parking"] <- "N"}
   
  if(grepl("furnished", df2[i,"Amenities"], ignore.case = TRUE)){df2[i,"Furnished"] <- "Y"}else{df2[i,"Furnished"] <- "N"}
   
  if(grepl("outdoor space", df2[i,"Amenities"], ignore.case = TRUE)){df2[i,"Outdoor.Space"] <- "Y"}else{df2[i,"Outdoor.Space"] <- "N"}
   
}
 
#________
 
#Laundry, Post-Pre War, Pets
 
for(i in 1:nrow(df2)){
   
  if(grepl("laundry room", df2[i,"Amenities"], ignore.case = TRUE)){df2[i,"Laundry"] <- "LRoom"}else{
     
    if(grepl("in-unit laundry", df2[i,"Amenities"], ignore.case = TRUE)){df2[i,"Laundry"] <- "In-Unit"}else{
       
      if(grepl("laundry", df2[i,"Amenities"], ignore.case = TRUE)){df2[i,"Laundry"] <- "Y"}
       
    }
     
  }
   
  if(grepl("post war", df2[i,"Amenities"], ignore.case = TRUE)){df2[i,"Pre.Post.War"] <- "Post"}else{
     
    if(grepl("pre war", df2[i,"Amenities"], ignore.case = TRUE)){df2[i,"Pre.Post.War"] <- "Pre"}
     
  }
   
  if(grepl("pets allowed", df2[i,"Amenities"], ignore.case = TRUE)){df2[i,"Pets"] <- "All"}else{
     
    if(grepl("cats only", df2[i,"Amenities"], ignore.case = TRUE)){df2[i,"Pets"] <- "Cats"}else{
       
      if(grepl("small dogs", df2[i,"Amenities"], ignore.case = TRUE)){df2[i,"Pets"] <- "Small"}else
         
        if(grepl("approved pets", df2[i,"Amenities"], ignore.case = TRUE)){df2[i,"Pets"] <- "Approval"}
       
    }
     
  }
   
  #SquareFeet
  if(grepl("sq. ft.", df2[i,"Amenities"], ignore.case = TRUE)){
     
    gregexpr("Sq. Ft.", df2[i,"Amenities"], ignore.case = FALSE)[[1]][1] - 2 -> loc
     
    substr(df2[i,"Amenities"], 0, loc) %>% 
      as.numeric() -> df2[i, "Square.Feet"]
  }
   
   
}
 
#Add new variable for rentals
rentals <- df2[,"Rental"]
str_replace_all(rentals, "\\$", "") %>% 
  str_replace_all(pattern = ",", replacement = "") %>% 
  as.numeric() -> rentals
 
df2[,"Rental.Num"] <- rentals
 
#Backup file
write.csv(df2, "Compiled_N_Variables.csv", row.names = FALSE)
 
Final_df <- select(df2, Location, 6:22)

So the final compiled dataframe has 39223 rows and 22 variables. Since i won’t be using quite a few of these columns, i took a subset of 18 variables from the compiled dataframe. Since the main reason why i needed coordinate information is to try out plotting data from shapefiles, i’ll only be plotting just a couple of graphs that depict the number of posts per layout.

Final_df2 <- Final_df
 
Final_df2 <- filter(Final_df2, Bed != "NA")
 
Final_df2$Bed <- as.factor(Final_df2$Bed)
Final_df2$Bath <- as.factor(Final_df2$Bath)
 
Final_df2 %>% group_by(Bed, Bath) %>% summarise(Count = length(Bed)) -> Bed_t
 
x = seq(0, 20000, by = 1000)
 
ggplot(Bed_t, aes(x = reorder(Bed, -Count), y = Count, fill = Bath)) + 
  geom_bar(stat = "identity") + scale_y_continuous(breaks = x) + 
  ggtitle("Number of posts by Bed and Bath") + xlab("Bed") + ylab("Number of posts") + 
  theme(axis.text.y=element_text(face="bold", size=13, color="black"), 
        axis.text.x=element_text(face="bold", size=13, color="black"), 
        plot.title=element_text(face="bold", size=16, color="black"))

No. of posts


Final_df2 <- Final_df
 
summary(select(Final_df2, Bed, Bath, Rental.Num))
 
Final_df2 = filter(Final_df2, Bed != "NA", Rental.Num <= 30000)
 
x = seq(0, 32000, by = 2000)
 
ggplot(Final_df2, aes(x = as.factor(Bed), y = Rental.Num)) + 
  geom_boxplot(color = "navy", outlier.colour = NA) + scale_y_continuous(breaks = x) + 
  theme(axis.text.y=element_text(face = "bold", color = "black", size = 13), 
        axis.text.x=element_text(face = "bold", color = "black", size = 13), 
        plot.title=element_text(size=16, face = "bold", color = "Black")) + 
  xlab("Bedrooms") + ylab("Rental") + ggtitle("Boxplot: Rental VS No. of Bedrooms") + 
  stat_summary(fun.y = "mean", geom="point", shape = 22, size = 2, fill = "red")

Rental VS Room

It appears that there are plenty of apartments out there for people looking for a 1 bedroom and a 2 bedroom layout. Since these are apartments in Manhattan, i have no idea what neighborhoods are considered fancy or what rental rate is considered excessive. So i can’t really give much insight on the second graph. As a sidenote, i turned off the outlier dots in the boxplot. In order to show the outliers, the part that says outlier.color = NA would need to be removed.

Moving on to plotting the coordinates. I initially thought that using the OpenStreetMap package would be helpful in this case, however i just didn’t know how to change the coordinates on the map to Lat/Lon. Much to my ignorance, it would appear that there are quite a number of different coordinate methods out there.

I finally settled with just using a shapefile that was available for download on the NYC Planning website. Using ggplot2’s fortify() function, the shapefile can be converted into a dataframe that can then be understood by ggplot2.

Started out with just plotting the data from the shapefile, and what i got was this:

Final_df2 <- Final_df
 
Final_df2 <- filter(Final_df2, Bed != "NA")
Final_df2 <- filter(Final_df2, Latitude != 0)
 
ggplot(data = Final_df2, aes(x = Longitude, y = Latitude)) +
  geom_polygon(data = map, aes(x = long, y = lat, group = group)) 

Manhattan map

Needless to say, not exactly the most aesthetically pleasing plot i’ve done. Before i fix the distortion in the map, i need to figure out how to plot contours in ggplot2. Nothing a little googling can’t fix. This is the code that i used:

Final_df2 <- Final_df
 
Final_df2 <- filter(Final_df2, Bed != "NA")
Final_df2 <- filter(Final_df2, Latitude != 0)
 
ggplot(data = Final_df2, aes(x = Longitude, y = Latitude)) + 
  stat_density2d()

Contour only

If you’re thinking what i’m thinking, you’ll notice that the contour plot has a shape that a little similar to the Manhattan plot i plotted earlier. The goal is to figure how to put both plots on the same graph, and then adjust for any colors and distortions.

ggplot(data = Final_df2, aes(x = Longitude, y = Latitude)) + 
  geom_polygon(data = map, aes(x = long, y = lat, group = group)) + stat_density2d() 

Map plus contour

So far so good. Now i’m going to remove the grid background, the axes ticks and labels:

ggplot(data = Final_df2, aes(x = Longitude, y = Latitude)) + 
  geom_polygon(data = map, aes(x = long, y = lat, group = group)) + stat_density2d() + 
  theme(axis.text=element_blank(), panel.grid=element_blank(), axis.title=element_blank(), 
        axis.ticks.x=element_blank(), axis.ticks.y=element_blank(), legend.position="none", 
        panel.background=element_blank())

Map plus contour no background

Mapping the alpha, fill, and color of the contours; and also the fill color of the map and background; and adding the coord_equal() function:

Final_df2 <- Final_df
 
Final_df2 <- filter(Final_df2, Latitude > 0)
 
Final_df2 <- filter(Final_df2, Bed != "NA")
 
ggplot(data = Final_df2, aes(x = Longitude, y = Latitude)) + 
   
  geom_polygon(data = map, aes(x = long, y = lat, group = group), , fill="#C7CBD0") + 
   
  stat_density2d(aes(alpha=..level.., fill=..level.., color=..level..), geom="polygon") + 
   
  theme(axis.text=element_blank(), panel.background=element_rect(fill = "#3C3E3F"),
        panel.grid=element_blank(), axis.title=element_blank(), 
        axis.ticks.x=element_blank(), axis.ticks.y=element_blank(), 
        legend.position="none") + coord_equal() 

Map - without heat

And finally, just changing the the colors of the contours to the traditional green-amber-red heat map:

Final_df2 <- Final_df
 
Final_df2 <- filter(Final_df2, Latitude > 0)
 
Final_df2 <- filter(Final_df2, Bed != "NA")
 
 
ggplot(data = Final_df2, aes(x = Longitude, y = Latitude)) + 
   
  geom_polygon(data = map, aes(x = long, y = lat, group = group), , fill="#C7CBD0") + 
   
  stat_density2d(aes(alpha=..level.., fill=..level.., color=..level..), geom="polygon") + 
   
  theme(axis.text=element_blank(), panel.background=element_rect(fill = "#3C3E3F"),
        panel.grid=element_blank(), axis.title=element_blank(), 
        axis.ticks.x=element_blank(), axis.ticks.y=element_blank(), 
        legend.position="none") + coord_equal() + 
   
  scale_fill_gradient(low = "#3BE819", high = "#B5170B") + 
  scale_color_gradient(low = "#3BE819", high = "#B5170B")

Map with heat

Again, i wouldn’t know much about neighborhoods in NYC, but judging from a quick check in Google Maps…the locations with the most posts are in Murray Hill and/or Midtown East, and the Upper West Side. I’m kind of interested to see how the result would be if i added a facet_wrap() to the code:

Final_df2 <- Final_df
 
Final_df2 <- filter(Final_df2, Latitude > 0)
 
Final_df2 <- filter(Final_df2, Bed != "NA")
 
 
ggplot(data = Final_df2, aes(x = Longitude, y = Latitude)) + 
   
  geom_polygon(data = map, aes(x = long, y = lat, group = group), , fill="#C7CBD0") + 
   
  stat_density2d(aes(alpha=..level.., fill=..level.., color=..level..), geom="polygon") + 
   
  theme(axis.text=element_blank(), panel.background=element_rect(fill = "#3C3E3F"),
        panel.grid=element_blank(), axis.title=element_blank(), 
        axis.ticks.x=element_blank(), axis.ticks.y=element_blank(), 
        legend.position="none") + coord_equal() + 
   
  scale_fill_gradient(low = "#3BE819", high = "#B5170B") + 
  scale_color_gradient(low = "#3BE819", high = "#B5170B") + facet_wrap(~Bed)

Map with facet_wrap

Not as appealing as i thought it would be. Perhaps if i combine the 2.5 bedroom category into the 3 bedroom, and the 3.5 to the 4 bedroom; it would perhaps look a little better. It would’ve also been better if i could plot the same map but having the colors mapped to the rental rate. That would enable me to filter a particular apartment layout, e.g. 1 Bed 1 Bath, and see which neighborhoods have the highest rental rates. But i ran into some trouble on plotting that function. For now, this’ll have to do.

If you’re interested to see how the layout of the output actually looks, let me know and i can help out.

Posted in Uncategorized | Tagged , , , , , | Leave a comment

Plotting Twitter Data

I’ve only just come to realize some social media sites allow you to download your entire historical activity in one large data file. I don’t know about other sites, but i’m aware that twitter does give you the option to download a data file showing all your tweets, retweets, when you tweeted them, who you retweeted, and who you’ve replied to. All of this also includes any links you might have included in the tweet, along with all the date/timestamps.

I went ahead and downloaded my data using the instructions here and started messing around with the data to see on which days i normally tweet the most and how is my twitter activity look overall since i started working.

I named the CSV file tweets, loaded it in RStudio, and started fooling around.

library(dplyr)
library(stringr)
library(ggplot2)

times = tweets[,"timestamp"]
stringr::str_replace_all(times, " \\+0000", "") -> times

as.POSIXct(times) %>% 
  weekdays() %>% table() %>% 
  data.frame() -> days

names(days) = c("Days", "Count")
days = days[c(2,6,7,5,1,3,4),]

days %>% ggplot(aes(Days, Count, fill = Count)) + geom_bar(stat = "identity") + 
  scale_x_discrete(limits = c("Monday", "Tuesday", "Wednesday", 
                              "Thursday", "Friday", "Saturday", "Sunday")) + 
  ggtitle("Total Number of Tweets on Each Day of the Week") + 
  theme(plot.title=element_text(face="bold")) + guides(fill=FALSE)

2016-01-01 15_41_52-Photos

Judging from the graph above which shows how many tweets i’ve sent on each day since early 2011, it’s safe to say that i’m not a big fan of tweeting on Mondays, but i do get pretty hyped up on weekends since Sat and Sun show the highest level of tweets; unsurprisingly.

But i would like to see a graph that’s a little bit more comprehensive. I’m thinking something along the lines of the number of tweets for each day, each month, since the beginning of 2011 – which was when i opened an account on twitter.

as.POSIXct(times) %>% 
  months() -> x

as.POSIXct(times) %>% 
  weekdays() -> y

substr(times, 0, 4) -> z

tDates = data.frame(x,y,z)

names(tDates) = c("Month", "Day", "Year")

s_tDates = summarise(group_by(tDates,Day, Month, Year), Count = length(Year))

s_tDates$Month = factor(s_tDates$Month, levels = c("January", "February", "March", "April", 
                                             "May", "June", "July", "August", "September", 
                                             "October", "November", "December"))

o_days = c("Monday", "Tuesday", "Wednesday", 
           "Thursday", "Friday", "Saturday", "Sunday")

ggplot(s_tDates, aes(Day, Count, fill = Year)) + geom_bar(stat = "identity") + 
  facet_wrap(~Month) + scale_x_discrete(limits = o_days) + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1)) + 
  scale_fill_brewer(palette = "Set2") + 
  theme(axis.text.y=element_text(face = "bold", color = "black"), 
        axis.text.x=element_text(face = "bold", color = "black"))

2016-01-01 15_47_58-New notification

To make some sense of this graph, it’s worth noting that i got my first job in May of 2012. This would perhaps explain why from January to April, the bars are covered in orange (2012), but then the orange suddenly almost disappears from sight starting from May. That’s because i was mainly slaving at this new job, and it didn’t allow much time to do anything else.

The graph does also seem to show that i’ve dialed back on the tweets since getting a full time job. A graph showing the overall trend in tweets should finally help confirm this.

summarise(group_by(tDates, Month, Year), Count = length(Year)) -> s_tDates

o_months = c("January", "February", "March", "April", "May", "June", 
             "July", "August", "September", "October", "November", "December")

ggplot(s_tDates, aes(Month, Count, fill = Count)) + geom_bar(stat = "identity") + 
  facet_wrap(~Year, nrow = 1) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_x_discrete(limits = o_months) + scale_fill_gradient(low="Black", high="Blue") + 
  guides(fill=FALSE) + theme(axis.text.y=element_text(face = "bold", color = "black"), 
                               axis.text.x=element_text(face = "bold", color = "black"))

2016-01-01 15_50_04-New notification

Yup. That pretty much confirms any doubts. The huge dip in tweet activity in the month of May 2012 indicates that my twitter account has been gathering an awful amount of dust since then.

With this new information, i’d also like to know at what hour do i normally tweet these days VS at what hour i used to tweet before i got my first job.

substring(times, 12, nchar(times[1])) -> hours
as.integer(substring(hours, 0,2)) -> hours

adj_hours = hours + 8

for(i in 1:length(adj_hours)){
  
  if(adj_hours[i] > 23){adj_hours[i] = adj_hours[i] - 24}
  
}

hours_df = data.frame(hours, z)
names(hours_df) = c("Hour", "Year")

summarise(group_by(hours_df, Hour, Year), Count = length(Year)) -> hours_df

Breaks = seq(1,23, by=1)

ggplot(hours_df, aes(Hour, Count, color = Year)) + geom_line(size = 1.1) + 
  scale_x_discrete(limits = Breaks) + geom_point(size=2, color = "black") + 
  ggtitle("Hour In Which Tweets Were Sent") + theme(plot.title=element_text(face="bold")) + 
  scale_y_continuous(breaks=seq(0,200,10)) + theme(axis.text.y=element_text(face = "bold", color = "black"), 
                                                   axis.text.x=element_text(face = "bold", color = "black"))

2016-01-01 15_52_02-Plot Zoom

In, 2011, it would seem that i usually start tweeting an awful lot from 11AM to 1PM. The same spike that appears in 2012 must be mainly from the first 4 months, because i started work in May. Every subsequent year doesn’t seem to any patterns on what time i normally tweet.

I guess there might be more ways in which a person’s twitter data file can be analyzed. In my example, i guess it’s very much possible to plot which users i retweet the most, which i reply to the most, which sources i link to, etc. But since i keep a pretty uninteresting twitter account, i don’t think that sort of information will be of much value. But you can definitely try it out on your own data, and it might just give you an insight into your social media activity.

Posted in Uncategorized | Tagged , , , , , , | Leave a comment

Exploring New and Used Car Data in Malaysia

I came across a local website where individuals/dealers in Malaysia can post information on used and new cars that they are selling. Why i, in particular, would browse such a site might dumbfound some people (those who personally know me would know what i’m talking about), i nevertheless found myself spending over an hour going through the posts put up my hundreds of users in Malaysia; and it got me wondering. It would be pretty interesting to explore these posts a little further by extracting the information from the site.

The website that i was referring to is carlist.my, which they claim is:

…Malaysia’s No.1 car site connecting car buyers, sellers and enthusiasts to a single platform which encompasses car classifieds and informative content.

Similar to the property data scraping post, i figured it would be best to limit the scraping to a particular categorical variable; in this case, a specific car manufacturer. I chose Perodua since i’ve noticed that most Malaysians consider it the more popular half of the two manufacturers that dominate the local market.

Alright, now that the manufacturer has been chosen, it’s time to study the site a little bit so that the code can accommodate for the site’s idiosyncrasies.

From the site’s search results page, i see that there are 12 search results per page. So going through 250 search result pages should give me all the links to 3000 car sale information, and navigating to these links should provide me with the information i need for each sale.

Seeing how i want to retrieve data for both used and new car sales, i’ve decided to run the code twice, once for the new cars URL and once for the used cars URL.

In contrast to what i did in the property data post, i decided to place the loops in a function and then assign the result of that function to a variable. The PC i was using had an Intel i3-4150 processor with a 4GB RAM and using a 5Mbps connection, on WiFi. Needless to say, it took an eternity, but that didn’t bother me. What did bother me was that it also ends with the computer crashing because of all the memory it takes to hold that information. Tried again on an i5-5200 computer with an 8GB RAM, same internet connection but this time using an ethernet cable (helplessly hoping that this would make a difference). It still took eons to complete, but at least it didn’t crash.

The link for the code i used is below. As a note, the site_first variable is where the URL for the search results page needs to be pasted, minus the page number.


#load libraries
library(stringr)
library(rvest)
library(ggvis)
library(dplyr)
library(ggplot2)

#The site
Site = "http://www.carlist.my/"

#The first half of the URL..
#Here you can paste the URL of the search results; used or new 
site_first = "http://www.carlist.my/used-cars/perodua?page_number=" 
  
#concatenate them together, with the coerced digit in between them. This digit is the page number
siteCom = paste(site_first, as.character(1), sep = "")
siteLocHTML = html(siteCom)

#Assign the links of the first page to x and y
siteLocHTML %>% html_nodes(".js-vr38dett-title") %>% 
  html_text() %>% data.frame() -> x

siteLocHTML %>% html_nodes(".js-vr38dett-title") %>% 
  html_attr("href") %>% data.frame() -> y

#Create function that will loop through the second page and onwards
Links = function(limit){
  
  for(i in 2:limit){
    
    siteCom = paste(site_first, as.character(i), sep = "")
    siteLocHTML = html(siteCom)
    
    siteLocHTML %>% html_nodes(".js-vr38dett-title") %>% 
      html_text() %>% data.frame() -> x_next
    
    siteLocHTML %>% html_nodes(".js-vr38dett-title") %>% 
      html_attr("href") %>% data.frame() -> y_next
    
    x = rbind(x, x_next)
    y = rbind(y, y_next)
    
  }

z = cbind(x,y)
return(z)
}

complete = Links(250) #The 250 represents the search pages
rm(x,y)

for(i in 3:12){
  complete[,i] = NA
}

headers = c("Desc", "Link", "Make", "Model", "Year", "Engine.Cap", "Transm", "Mileage", "Color", "Car.Type", "Updated", "Price")

names(complete) = headers

write.csv(complete, "Links_CarList_Used.csv", row.names = FALSE) #export the links, for backup

Details = function(dataframe, i){

  for(j in 1:nrow(dataframe)){  
  
        link = html(paste(Site, dataframe[j, i], sep = ""))
        
        link %>% html_nodes("#single-post-detail") %>% 
          html_nodes(".section-body") %>% 
          html_nodes(".row-fluid") %>% 
          html_nodes(".span6") %>% 
          html_nodes(".list") %>% 
          html_nodes(".tr-make") %>% 
          html_nodes(".data") %>% 
          html_text() -> a
        
        a = str_replace_all(a, "\n", "")
        a = str_replace_all(a, " ", "")
        
          dataframe[j,"Make"] = a
        
        link %>% html_nodes("#single-post-detail") %>% 
          html_nodes(".section-body") %>% 
          html_nodes(".row-fluid") %>% 
          html_nodes(".span6") %>% 
          html_nodes(".list") %>% 
          html_nodes(".tr-model") %>% 
          html_nodes(".data") %>% 
          html_text() -> a
        
        a = str_replace_all(a, "\n", "")
        a = str_replace_all(a, " ", "")
        
        dataframe[j,"Model"] = a
        
        link %>% html_nodes("#single-post-detail") %>% 
          html_nodes(".section-body") %>% 
          html_nodes(".row-fluid") %>% 
          html_nodes(".span6") %>% 
          html_nodes(".list") %>% 
          html_nodes(".tr-year") %>% 
          html_nodes(".data") %>% 
          html_text() -> a
        
        a = str_replace_all(a, "\n", "")
        a = str_replace_all(a, " ", "")
        
        dataframe[j,"Year"] = a
        
        link %>% html_nodes("#single-post-detail") %>% 
          html_nodes(".section-body") %>% 
          html_nodes(".row-fluid") %>% 
          html_nodes(".span6") %>% 
          html_nodes(".list") %>% 
          html_nodes(".tr-engine") %>% 
          html_nodes(".data") %>% 
          html_text() -> a
        
        a = str_replace_all(a, "\n", "")
        a = str_replace_all(a, " ", "")
        
        dataframe[j,"Engine.Cap"] = a
        
        link %>% html_nodes("#single-post-detail") %>% 
          html_nodes(".section-body") %>% 
          html_nodes(".row-fluid") %>% 
          html_nodes(".span6") %>% 
          html_nodes(".list") %>% 
          html_nodes(".tr-transmission") %>% 
          html_nodes(".data") %>% 
          html_text() -> a
        
        a = str_replace_all(a, "\n", "")
        a = str_replace_all(a, " ", "")
        
        dataframe[j,"Transm"] = a
        
        link %>% html_nodes("#single-post-detail") %>% 
          html_nodes(".section-body") %>% 
          html_nodes(".row-fluid") %>% 
          html_nodes(".span6") %>% 
          html_nodes(".list") %>% 
          html_nodes(".tr-mileage") %>% 
          html_nodes(".data") %>% 
          html_text() -> a
        
        a = str_replace_all(a, "\n", "")
        a = str_replace_all(a, "  ", "")
        
        dataframe[j,"Mileage"] = a
        
        link %>% html_nodes("#single-post-detail") %>% 
          html_nodes(".section-body") %>% 
          html_nodes(".row-fluid") %>% 
          html_nodes(".span6") %>% 
          html_nodes(".list") %>% 
          html_nodes(".tr-color") %>% 
          html_nodes(".data") %>% 
          html_text() -> a
        
        a = str_replace_all(a, "\n", "")
        a = str_replace_all(a, " ", "")
        
        dataframe[j,"Color"] = a
        
        link %>% html_nodes("#single-post-detail") %>% 
          html_nodes(".section-body") %>% 
          html_nodes(".row-fluid") %>% 
          html_nodes(".span6") %>% 
          html_nodes(".list") %>% 
          html_nodes(".tr-car-type") %>% 
          html_nodes(".data") %>% 
          html_text() -> a
        
        a = str_replace_all(a, "\n", "")
        a = str_replace_all(a, " ", "")
        
        dataframe[j,"Car.Type"] = a
        
        link %>% html_nodes("#single-post-detail") %>% 
          html_nodes(".section-body") %>% 
          html_nodes(".row-fluid") %>% 
          html_nodes(".span6") %>% 
          html_nodes(".list") %>% 
          html_nodes(".tr-updated") %>% 
          html_nodes(".data") %>% 
          html_text() -> a
        
        a = str_replace_all(a, "\n", "")
        #a = str_replace_all(a, " ", "")
        
        dataframe[j,"Updated"] = a
        

        link %>% html_nodes("#single-post-header") %>% 
          html_nodes(".post-highlight") %>% 
          html_nodes(".price") %>% 
          html_text() -> a
        
        a = str_replace_all(a, "\n", "")
        a = str_replace_all(a, " ", "")
        
        dataframe[j,"Price"] = a
        
}

return(dataframe)

}

Final = Details(complete, 2)
write.csv(Final, "Final_CarList_Used.csv", row.names = FALSE)

The information that was extracted for each post are listed below, with the column name in paratheses:
1. Description of the post (Desc)
2. Link to the post (Link)
3. Manufacturer name (Make)
4. Car model (Model)
5. Year of the model (Year)
6. Engine’s capacity (Engine.Cap)
7. Transmission (Transm)
8. Mileage (Mileage)
9. Color (Color)
10. New or Used (Car.Type)
11. Date post was updated (Updated)
12. Price of the car (Price)

And that brings us to the most tedious part of the whole activity: cleaning the data and re-classification.

As it stands, all the variables are of class character, but the way i see it is that only the first two should be remain that way. Price and Mileage should be converted to integers while Model, Year, Transm, Car.Type, Color and Engine.Cap should be factors; and Year into POSIXct.

Changing the class of each variable wasn’t so much of an issue, except in the case of Mileage. On the site, Mileages are either stated as, for instance, “95,000”; or as a range, such as “95 – 105k km”. The format for the former wasn’t much trouble as it only involved having to remove the comma, and then coercing the class to integer using as.integer(). The issue lied in the posts that contain ranges. In order to have a nice tidy format, i decided to average the mileages that were posted (whether this is acceptable, however, i’m not sure). This means isolating both ends of the range in separate columns and then using dplyr’s mutate() function to create a new variable. The code is below:


miles = compiled_fac[, "Mileage"]

#Remove "km" from all observations
miles = str_replace_all(miles, "km", "")

#Isolate observations that have "k" in a separate dataframe
miles_df = data.frame(miles[grep("k", miles)])

#Add two columns to this new df
miles_df[,"Start"] = NA
miles_df[,"Stop"] = NA



for(i in 1: nrow(miles_df)){
  
  #To assign the location of the hyphen in string
  limit = as.integer(gregexpr(" - ", miles_df[i,1]))
  
  #Extract only lower end of the range and assign to Start column
  miles_df[i, "Start"] = substr(miles_df[i,1], 0, limit - 1)
  
  #Extract only higher end of the range and assign to Stop column
  miles_df[i, "Stop"] = substr(miles_df[i,1], 
                               as.integer(gregexpr(" - ", miles_df[i,1])) + 3, 
                               nchar(as.character(miles_df[i,1])))
  
}

#Remove the "k" from lower end
miles_df[, "Stop"] = str_replace_all(miles_df[,"Stop"], "k ", "")

#check if any NAs
anyNA(as.integer(miles_df[, "Stop"])) 
anyNA(as.integer(miles_df[, "Start"]))

#Prepare columns for calculation
miles_df[,"Start"] = as.integer(miles_df[,"Start"])
miles_df[,"Stop"] = as.integer(miles_df[,"Stop"])


dplyr::mutate(miles_df, avg_miles = ((Start + Stop)/2)*1000) -> miles_df_adj
new_miles = miles_df_adj$avg_miles

miles[grep("k", miles)] = new_miles

miles = as.integer(str_replace_all(miles, ",", ""))

compiled_fac[,"Mileage"] = miles

So, the gist of it is turning this…

2015-12-13 18_33_06-Search

…into this…

2015-12-13 18_34_28-Start

With that, we now have the complete cleaned data set. Since the code was run once for used car data and a second time for new car data, we have a total of 6000 observations equally divided between new and used. The final dimensions of the finished dataframe is 6000 observations of 12 variables, two of which are the description of the post and the link.

And now we have reached the visual aspect of the exploration. I want to see which model appears to have the most posts in this sample of ours. I’ve never used a mosaic plot before, but i guess there is always a first time.:

 
vcd::mosaic(~ Model + Car.Type, data = compiled_fac, 
            direction = c("v", "h"), highlighting="Car.Type", 
            highlighting_fill=c("light blue", "dark grey"), spacing = j)

2015-12-14 21_33_02-Plot Zoom

I had to set the spacing to “2” because the labels for the Kancil, Kelisa, Kembara, and Kenari were overlapping each other. I couldn’t do much for the Nautica and Rusa. I’m not even sure if it’s possible to tilt the labels at an angle, like in ggplot2.

In any case, unsurprisingly, the ever-so prominent Myvi appears to be the most popular of Perodua’s cars; with the posts for used car sales being as much as the new ones. Following in popularity are the Viva and Alza, but there appear to be more New Car sales for the Viva  than Used Cars. There are only used car posts for the Kancil, Kelisa, Kembara, and Kenari. I would have been surprised if there were any new car posts for any of those, to be honest.

It seems that 2010 represents the biggest category of used cars, as far the model year is concerned, followed by the 2011 and 2012. I’m guessing it’s because people think they can perhaps get some kind of upgrade on their current cars by dishing the old ones and get a new one…Perodua or otherwise. The graph depicting this is below:

simple_sum = filter(compiled_fac, Car.Type == "UsedCar")
simple_sum = summarise(group_by(simple_sum, Year, Model), Count = length(Model))

ggplot(simple_sum, aes(reorder(Year,-Count), Count, fill = Model)) + 
  geom_bar(stat = "identity") +   #coord_flip() + 
  scale_fill_brewer(palette = "Paired") + 
  xlab("Model Year") + ggtitle("Number of sale posts, by Model and Year") + 
  theme(plot.title=element_text(size=16, face = "bold", color = "Black")) + 
  theme(axis.text.y=element_text(face = "bold", color = "black", size = 12), 
        axis.text.x=element_text(face = "bold", color = "black", size = 12))

2015-12-14 22_00_07-Plot Zoom

I could go on about which groups represent what percentage of the data, but i reeeally want to start putting up some boxplots on the prices. But before i can do that, i had to manipulate the Engine.Cap variable so that there are only 6 categories: 0.66, 0.85, 1.00, 1.3, 1.5, and the 1.6. I haven’t added the code for this manipulations, but it’s nothing new. It’s just a matter of assigning the Engine.Cap column to a vector, removing all the “cc” using the str_replace_all() in the stringr package, changing all the 1500 to 1.5, 1295 to 1.3..and so on; and then assigning that vector back to the Engine.Cap column.

Now, on to the boxplots. Let’s take a look at a simple one of the car models VS the prices.

#No Mileage for new cars, remove column
simple_sum = select(compiled_fac, -Mileage)

simple_sum = filter(simple_sum, Car.Type == "NewCar")

simple_sum = na.omit(simple_sum)

x = seq(20000, 70000, by = 2000)

ggplot(simple_sum, aes(reorder(Model,-Price), Price)) + geom_boxplot(fill = "light blue") + 
  stat_summary(fun.y = "mean", geom="point", shape = 22, size = 2, fill = "red") + 
  scale_y_continuous(breaks = x) + 
  xlab("Model") + ggtitle("Boxplot: Car Model VS Price") + 
  theme(plot.title=element_text(size=16, face = "bold", color = "Black")) + 
  theme(axis.text.y=element_text(face = "bold", color = "black"), 
        axis.text.x=element_text(face = "bold", color = "black", size = 12))

2015-12-17 11_01_22-Plot Zoom

No big revelations here, but since the model is not the only factor that determines the price of a new car, i would like to see how the price differences look like when i add the transmission type


ggplot(simple_sum, aes(reorder(Model,-Price), Price)) + geom_boxplot(fill = "orange") + 
  stat_summary(fun.y = "mean", geom="point", shape = 22, size = 2, fill = "red") + 
  facet_grid(~Transm) + scale_y_continuous(breaks = x) + 
  xlab("Model") + ggtitle("Boxplot: Car Model VS Price, by Transmission Type") + 
  theme(plot.title=element_text(size=16, face = "bold", color = "Black")) + 
  theme(axis.text.y=element_text(face = "bold", color = "black"), 
        axis.text.x=element_text(face = "bold", color = "black", size = 12))

2015-12-17 10_54_42-Plot Zoom

There seems to be a slight difference in prices for the Alza, Myvi, and Viva. The Axia, however, indicates a pretty big price difference if i was to judge solely on the transmission. But what about the engine capacity? I’m sure that must have some effect on how much the car would cost.


simple_sum = filter(simple_sum, Engine.Cap != 0.66, Engine.Cap != 0.85)

ggplot(simple_sum, aes(reorder(Model,Price), Price)) + geom_boxplot(fill = "light pink") + 
  stat_summary(fun.y = "mean", geom="point", shape = 22, size = 2, fill = "red") + 
  facet_grid(Transm~Engine.Cap) + scale_y_continuous(breaks = x) + 
  xlab("Model") + ggtitle("Boxplot: Car Model VS Price, by Transmission Type and Engine Capacity") + 
  theme(plot.title=element_text(size=16, face = "bold", color = "Black")) + 
  theme(axis.text.y=element_text(face = "bold", color = "black"), 
        axis.text.x=element_text(face = "bold", color = "black", size = 12))

2015-12-17 22_24_49-New notification

The only model affected by the engine capacity variable appears to be the Myvi, since you can get it in the 1.3 or the 1.5; there is a very apparent price difference between those two categories. However, selecting between transmissions shouldn’t be too much of a concern, considering the relatively small price difference. The same can’t be said for the Axia because of the very clear shift between the averages  when comparing prices according to transmission.

There is one more variable that is available in the dataset that would likely shed some more light on the price difference, and that is the model year.

 


ggplot(simple_sum, aes(reorder(Model,Price), Price, fill = Year)) + geom_boxplot() + 
  stat_summary(fun.y = "mean", geom="point", shape = 22, size = 2, fill = "red") + 
  facet_grid(Transm~Engine.Cap) + scale_y_continuous(breaks = x) + 
  xlab("Model") + ggtitle("Boxplot: Car Model VS Price, by Transmission Type, Engine Capacity, and Model Year") + 
  theme(plot.title=element_text(size=16, face = "bold", color = "Black")) + 
  theme(axis.text.y=element_text(face = "bold", color = "black"), 
        axis.text.x=element_text(face = "bold", color = "black", size = 12)) #+ scale_color_brewer(palette = "Dark2")

2015-12-17 22_38_44-New notification

It would appear that the price difference between a 2014 and 2015 1.5L Alza on a manual transmission, is somewhat lower compared to an automatic with the same details. However there are noticeable difference in the case of the Myvi.

Last but not least, to satisfy my own curiosity, i was interested to see how the plot looked like isolating only for Myvis of the 2014 and 2015 variety:


simple_sum = select(compiled_fac, -Mileage)
simple_sum = filter(simple_sum, Model == "Myvi", Year == c("2015", "2014"))
simple_sum = na.omit(simple_sum)

x = seq(20000, 70000, by = 2000)

ggplot(simple_sum, aes(reorder(Model,Price), Price, fill = Year, color = Car.Type)) + geom_boxplot() + 
  stat_summary(fun.y = "mean", geom="point", shape = 22, size = 2, fill = "red") + 
  facet_wrap(Engine.Cap~Transm, nrow = 2) + scale_y_continuous(breaks = x) + 
  xlab("Model") + ggtitle("Comparing Used VS New Myvi - 1.3L VS 1.5L, 2014 VS 2015, and Manual VS Auto") + 
  theme(plot.title=element_text(size=16, face = "bold", color = "Black")) + 
  theme(axis.text.y=element_text(face = "bold", color = "black"), 
        axis.text.x=element_text(face = "bold", color = "black", size = 12)) + 
  scale_fill_brewer(palette = "Dark2") + scale_color_brewer(palette = "Set1")



2015-12-17 23_48_49-Plot Zoom

I expected a pretty big price difference between a new 2014 and a used 2014, but i figured the new and used 2015 would not have  a gap that is as noticeable

It’s possible to explore and visualize the data in other ways, and likely in ways that are better than mine – since i’m only starting out. It’s maybe even possible to develop a model to determine the price of a used car using the data  – but i’m not exactly certain how to go about that. I would need to at least finish the regression models course on coursera for me to be certain of that i’m doing.

As always, the data file is below and you can play around with it as much as you’d like. 🙂

Final_CarList_Compiled.csv

Posted in Uncategorized | Tagged , , , , , | Leave a comment

Web Scraping: The Sequel | Propwall.my

Alright. Time to take another shot at web scraping. My attempt at scraping data off iBilik.my left me a little frustrated because of how long it took, and also at how i couldn’t get much information because of all the duplicated posts.

I think Propwall.my would be a much better choice considering how their listings are arranged. Propwall is:

…Malaysia’s most ADVANCED property search website that provides marketing and research solutions to property agents, developers, and investors.

The emphasis on “advanced” was not mine. They just seem to be pretty sure of themselves.

I figured that since i have some idea of how to do this now, it would be a lot easier to organize my code in a way that’s more neat. Well…for me at least. And plus, i could insert more comments this time.

I picked Mont Kiara as my location and I started off the same way, with loading the stringr, rvest, and ggvis packages.

#load libraries
library(stringr)
library(rvest)
library(ggvis)
library(dplyr)
library(ggplot2)

#The first half of the URL..
site_first = "http://www.propwall.my/mont_kiara/classifieds?page="
#...and this is the second half of the URL
site_second = "&amp;amp;amp;tab=Most%20Relevance&amp;amp;amp;keywords=Mont%20Kiara%2C%20Kuala%20Lumpur&amp;amp;amp;filter_id=17&amp;amp;amp;filter_type=Location&amp;amp;amp;listing=For%20Rent&amp;amp;amp;view=list"

#concatenate them together, with the coerced digit in between them. This digit is the page number
siteCom = paste(site_first, as.character(1), site_second, sep = "")
siteLocHTML = html(siteCom)

You might be wondering why i keep loading ggvis if i won’t be using their charts. Embarrassingly enough, it’s because i keep forgetting which package allows me to use the piping operator “%>%”, but i always remember that i can use that operator if ggvis is loaded.

Similar to what i did in the iBilik.my post, i extracted the information for the first search page, and then rbind-ed the remaining pages; as so:

#Extract the descriptions of the first page...
siteLocHTML %&amp;gt;% html_nodes("h4.media-heading a") %&amp;gt;% 
  html_text() %&amp;gt;% data.frame() -&amp;gt; x

#...and also the links to these postings 
siteLocHTML %&amp;gt;%  html_nodes("#list-content") %&amp;gt;% 
  html_nodes(".media") %&amp;gt;% 
  html_nodes(".media-heading") %&amp;gt;% 
  html_nodes("a") %&amp;gt;%
  html_attr("href") %&amp;gt;% 
  data.frame() -&amp;gt; y

#Since we already have the extractions for the first page, so no we...
#...can loop through numbers 2 to 250 and rbind them with page 1 extracts
for(i in 2:250){

  siteCom = paste(site_first, as.character(i), site_second, sep = "")
  siteLocHTML = html(siteCom)
  
siteLocHTML %&amp;gt;% html_nodes("h4.media-heading a") %&amp;gt;% 
  html_text() %&amp;gt;% data.frame() -&amp;gt; x_next

siteLocHTML %&amp;gt;%  html_nodes("#list-content") %&amp;gt;% 
  html_nodes(".media") %&amp;gt;% 
  html_nodes(".media-heading") %&amp;gt;% 
  html_nodes("a") %&amp;gt;%
  html_attr("href") %&amp;gt;% 
  data.frame() -&amp;gt; y_next

x = rbind(x, x_next)
y = rbind(y, y_next)

}

#column bind the description and links
complete = cbind(x,y)
complete[,2] = as.character(complete[,2])

names(complete) = c("Description", "Link")

#file backup
write.csv(complete, "complete_propwall.csv", row.names = FALSE)

#And remove the remaining dataframes from the environment
rm(x_next, y_next, x, y)

Up to this point, the code has sifted through 250 search pages, with 20 listings each; which adds up to 5000 listings. However, i only have the links to the postings, and not the information such as rental, layout, etc. In order for me to have that information, i would need to navigate to each link and get the rental, layout, and date…for 5000 posts.

There was no means with which i could determine if a post was duplicated or not, so i haven’t removed any of the 5000 posts. The only observations that were removed were based on unavailability of certain information (no rental info given) or the ridiculousness of a rental rate (RM 1 million rental! Probably meant for the sale section and not the rental section).

Anywho, the next step was to get the post information, and i used the code below:

for(i in 1:nrow(complete)){
siteLocHTML = html(complete[i,"Link"])

siteLocHTML %&amp;gt;%  html_nodes(".clearfix") %&amp;gt;%
  html_nodes("#price") %&amp;gt;% html_text() %&amp;gt;% c() -&amp;gt; y

#Rentals come out in quadruples, but i need only one
y = unique(y)
price = c(price, y)


siteLocHTML %&amp;gt;% html_nodes("#content") %&amp;gt;% 
  html_nodes("p") %&amp;gt;% html_text() -&amp;gt; z

dates = c(dates, z)

siteLocHTML %&amp;gt;%  html_nodes(".clearfix") %&amp;gt;%
  html_nodes("td") %&amp;gt;% html_text() %&amp;gt;% c() -&amp;gt; a
b = a[8]
a = a[4]

furnish = c(furnish, a)
layout = c(layout, b)

}


#take only the first 4922 rows
final = complete[1:4922,] 

#cbind the new columns
final$price = price
final$furnish = furnish[1:4922]
final$layout = layout[1:4922]
final$posted = dates[1:4922]

#remove NAs
final_fil = na.omit(final)

#remove other NAs as shown in the website
final_fil = final_fil[!final_fil[,4] == "-NA-",]
final_fil = final_fil[!final_fil[,5] == "-NA-",]

#remove rownames
rownames(final_fil) = NULL

final_fil$furnish = as.factor(final_fil$furnish)
final_fil$layout = as.factor(final_fil$layout)

rm(y,z,a,b,price, furnish, layout, dates)

I should mention that i had run with some issues in finishing the data extraction up to this point. The issue mainly had to do with the fact after 4922 posts, i got an error. The error had more to do with navigating to so many pages than the code itself. With that in mind, i had to limit my sample to 4922 observations.

Next up, a little string manipulation to get the rental rates in a format more acceptable for calculations. I also need to get only the dates from the posting information. I changed the original formatting for rentals and dates…

> "RM 5,200 (RM 3 psf)"
> "Posted by Champion YEE 016-6012080 on 02/11/2015"

…into this…

> 5200
> 02/11/2015

and the code i used was the following script:

#Extracting dates and rentals
x = c()
for(i in 1:nrow(final_fil)){

  x[i] = substring(final_fil[i, "posted"], nchar(final_fil[i, "posted"])-9, nchar(final_fil[i, "posted"]))

}

rm(complete, final,a,b,furnish, i, layout, price)

final_fil$date = x

x = c()

for(i in 1:nrow(final_fil)){
  
y = gregexpr("\\(",final_fil[i,"price"])[[1]][1]

x[i] = substring(final_fil[i,"price"], 4, y-2)
  
}
z
x = as.integer(str_replace_all(x, ",", ""))

final_fil$rental = x 

And last but not least, the name of the residency. I already have the names of the residency written as, for instance, Vista Kiara, Mont Kiara”. What i didn’t want was the “Mont Kiara” bit, and so i added another column “residency” that holds only the names:

#Extract only the name of the residency
final_fil$residency = str_replace_all(final_fil[,1], ", Mont Kiara", "")

Anyone who uses R will tell you that dplyr is awesome. It’s just….just sooo nice. Although i’m still trying to remember everything it can do, what follows is perhaps a very good example of dplyr’s awesomeness.

Since i now have the residency name, the rentals, and the layouts; i can produce a single dataframe that shows the average rental for each layout, for each residency….and all that with one line of dplyr magic:

Mkiara_data = data.frame(summarise(group_by(final_fil, residency, layout), Average_Rental = mean(rental)))

And so if we take a look at the last 6 rows of this data frame, you’ll see:

tail(Mkiara_data)

residency layout Average_Rental
162 i-Zen Kiara I Studio 9000.000
163 i-Zen Kiara II 1-Bedroom 6165.714
164 i-Zen Kiara II 2-Bedroom 3600.000
165 i-Zen Kiara II 3-Bedroom 5450.000
166 i-Zen Kiara II 4-Bedroom 5943.750
167 i-Zen Kiara II 5-Bedroom 6668.7507

Hadley Wickham should win an award just for developing that.

Now time to take a look at the data by looking at some graphs. Using the final_fil dataframe, a table can be made that can count the number of times the residency name appears and i can use that data to plot a graph.

#Number of posts per residency
x = data.frame(table(final_fil$residency)) 
names(x) = c("Residency", "Number_of_Posts")  

ggplot(x, aes(x=reorder(Residency,Number_of_Posts), y=Number_of_Posts)) + 
  geom_bar(stat="identity", color="white") + 
  xlab("Residency") + ylab("Number of posts") + coord_flip() + 
  ggtitle("Number of posts per residency") + 
  theme(plot.title=element_text(size=16, face = "bold", color = "Red"))

The resulting graph will look like this:

Mont Kiara - Number of posts per residency

(You can click on the image to get a better look at the chart)

I would like to think that the higher the number of posts, the more unpopular the residency is. Obviously, there are some arguments against that, but using that understanding we can see that there are very few posts for places such as The Residence and Aman Kiara, while places such as Vista Kiara and Ceriaan Kiara have seen a lot of rentals available. I’m no property guy, but i’m guessing there are plenty of more factors at play here such as when was the property launched, the type of residency (condo, house, etc), the neighborhood (metro, suburban), and so on and so forth. I’ll let you figure that one out.

Perhaps we need some more information, such at the proportions of these listings that are categorized under what level of furnishing.

#Create table for residency and furnishing type
z = data.frame(table(final_fil[,c("residency", "furnish")]))
z = z[z[,"Freq"] != 0,]
names(z) = c("residency", "furnishing", "Posts")
rownames(z) = NULL

#Plot
ggplot(z, aes(x=reorder(residency,Posts), y=Posts, fill=furnishing)) + 
  geom_bar(stat="identity") + coord_flip() + 
  scale_fill_brewer(palette="Dark2") + 
  xlab("Residency") + ggtitle("Number of posts per residency, with furnishing type") + 
  theme(plot.title=element_text(size=16, face = "bold", color = "Black")) + 
  theme(axis.text.y=element_text(face = "bold", color = "black"), 
        axis.text.x=element_text(face = "bold", color = "black"))

Listings by furnishing type

I think it’s safe to say that very few, if not none, of the listings have been labeled as unfurnished. I think that’s mainly because many of the owners/agents that post on these sort of sites classify a unit as semi-furnished if it only has things such as a fan, air-conditioner, and a kitchen cabinet; but, ironically enough, no actual furniture. From my experience, what is considered unfurnished in Malaysia is, quite literally, an empty unit…with wires still hanging off the ceiling and walls.

I thought it would also be interesting to see how the layouts are distributed in all of these listings.

#Create table for residency and layout columns
y = data.frame(table(final_fil[,c("residency", "layout")]))
y = y[y[,"Freq"] != 0,]
names(y) = c("residency", "layout", "Posts")
rownames(y) = NULL

#Plot
ggplot(y, aes(x=residency, y=Posts, fill=layout)) + 
  geom_bar(stat="identity") + coord_flip() + 
  scale_fill_brewer(palette="Dark2") + 
  xlab("Residency") + ggtitle("Number of posts per residency, with layouts") + 
  theme(plot.title=element_text(size=16, face = "bold", color = "Black")) + 
  theme(axis.text.y=element_text(face = "bold", color = "black"), 
        axis.text.x=element_text(face = "bold", color = "black"))

And that gives us…

Number of posts, including layouts

I understand that it would’ve been better if the bars were in descending order, but i just couldn’t figure out how to do that. The usual reordering inside the “aes” did not work and i’m still trying to understand why. So for now, i figured ggplot2’s usual alphabetical order will suffice.

As you can clearly see, a big portion of all posts are categorized under 3-bedrooms. The most notable of which is the one for Vista Kiara which, if you recall, had the highest number of rental listings in our sample.

And this finally brings us to the numbers that matter. The rental rates. All numbers here are shown as rental per month. The following code is to determine the average rental per layout type in Mont Kiara. After having inspected the dataframes, it looks like there are plenty of outliers in each category. With that in mind, i think a boxplot would be best. And plus, It will be great to see how much would it cost to rent a unit of a particular room layout, on average; in Mont Kiara.

Breaks= c()
for(i in 1:30){Breaks[i] = i*1000}

ggplot(final_fil, aes(x=layout, y=rental)) + geom_boxplot() + 
  stat_summary(fun.y = "mean", geom="point", shape = 22, size = 3, fill = "red") + 
  xlab("Layout") + ggtitle("Distribution of rentals per layout") + ylab("Rentals") + 
  theme(plot.title=element_text(size=16, face = "bold", color = "Black")) + 
  theme(axis.text.y=element_text(face = "bold", color = "black"), 
        axis.text.x=element_text(face = "bold", color = "black")) +
  scale_y_continuous(breaks=Breaks)

Boxplot, by unit layout

There seem to be a whole bunch of outliers for the 1-bedroom category, which explains why the mean (the red dot) is not so close to the median (the horizontal line in the box). The median would indicate that the average rental for a 1-bedroom might range somewhere around RM 3500. That conclusion seems to be concurred by the Studio category, since many people consider those two groups one and the same.

It’s also worth noting that there doesn’t seem to be much difference between the 1-bedroom and the 2-bedrooms, while the 8-bedroom category seems to be from one data point. I’m not exactly certain what’s the difference between a 4-bedroom and a 4.1 bedroom, but if the means are to be used, it’ll cost you about another RM 1000 in rent. The 3-bedroom, the category that seems to be readily available as we’ve seen in the previous charts; goes from 4500 to 5000 ringgit. I would think that’s still pretty steep, even if most of the listings are fully furnished.

I was hoping to end this post with a chart that shows the average rental for each residency, and for each layout; but i’m worried that it would end up as a lot of “visual noise”. Besides, i still need to figure out how to arrange the bars using ggplot2 in circumstances where i’m using 3 variables, of which 2 are categorical. The best that i could come up with is using the facet_grid() function in ggplot2. And so here they are, two at a time:

1-Bedroom and Studio

2-Bedroom and 3-Bedroom

4-Bedroom and 5-Bedroom

6-Bedroom

I haven’t plotted the charts for 4.1 bedrooms and 6+ bedrooms because they seem to be coming from very few observations, and so there really was no need to plot them.

As always below are all the data i worked with.

Propwall_link, Propwall_link_details_edited, Propwall_averages.

Posted in Uncategorized | Tagged , , , , | Leave a comment