Using Website & Sitemap Crawl and xPath with R

mstlucky8072 · Post by **mstlucky8072** » Mon Dec 09, 2024 8:06 am

I am trying to convey to you what can be done in SEO with the help of R on the ZEO Blog. We have previously talked about how to retrieve data with an API using R and how to use it on the Ads side. Today, I will discuss crawling a website, using XPath on the crawled site, and what you can do when you only want to crawl a sitemap. You can also find the sources of the packages I use and benefit from at the bottom of the article.

Are you ready to make your work a little easier with R?

With R, you can crawl a website for free and get the data you want with XPath commands. You can do many things that tools like Screaming Frog SEO Spider or Deepcrawl can do with the operations I will explain.

Let's quickly move on to its use. You can get information about the employment database installation in my article where I explain the installation and other settings of R Studio .

First, we install our packages related to R Crawler;

Then we start the scan by entering the site address to scan our site. The duration of this process will vary depending on the size of the site and your computer .

I scanned 150 URLs from ZEO's website to provide examples and to quickly add images to my article. Let's display the scan results with the code below;

View(INDEX)

All pages on the site came with status codes etc. Let's go into a little more detail in the data and pull sections like title and H1 using XPath;

You can write any XPath commands you want to pull the data you want as XPath (such as description, H2). I also recommend that you watch the presentation of Mert from our team, where he explains XPath-related topics .

When the crawl is complete, it shows me the data directly;

The data has arrived, but I don't like it that much; I need to visualize this data a little more. To do this, you can create the following data frame;

My data is now more understandable;

In column X7, I can easily understand whether the page type is page-blog or a tool page. This data will vary depending on the codes used in the design of your site.

If I merge the XPath parts with the scan results, my table will become a structure that contains more data;

When I run the code, my data comes in a merged form;

While doing the visualization, instead of explaining to the relevant people what X7 means, let's immediately change the name of the X7 column. Since this data is in the 17th row of my table, I write 17 in square brackets;

Finally, let's come to the data visualization part. I created a DF named "d", I draw the graph with the following codes specific to it. In this way, I can determine which page type takes up more space on my site;