Over the past 8 weeks, I have been crawling dozens of ecommerce websites across different industries with one goal in mind - create beautiful visualizations of website structures to understand what is required to create organized and navigable structures for both humans and search engines.
The first issue I had to sort out was the software/hardware stack, both were hopelessly inept for the mission. It was time for an enterprise-level upgrade.
Here's the stack I used -
- 24core, 64GB RAM, 1TB SSD Google VM
- Gephi for Network Analysis
- Dataiku & Python for data cleansing
- Screaming Frog SEO Spider (SFSS)
- Debian 10 OS
- Adobe XD for creating the visuals
One of the biggest blockers for this project was not being able to clean the CSV file generated by SFSS with Excel. It has a limit of 10M rows and most of these CSV files had 50M+ rows.
After some sleepless nights and lots of coffee, I stumbled upon Dataiku and it was a lifesaver. With a simple recipe in Dataiku and a little bit of Python, I was able to clean data with 100M+ rows in a single file, with ease.
To begin this laborious pursuit, I handpicked 15 popular ecommerce websites, from small(<10K pages) to Amazon(∞). Not all of them made the cut though, turns out you cannot crawl Amazon and BestBuy. Who knew?
It's not because of the reasons one might think - No IP blocking or limiting crawl rate(at least not until 10M pages). Their servers are capable of handling 10K concurrent users so a measly crawler is not going to cause any harm.
The issue is the size of these websites. Amazon, for example, has 119M products and that product page data alone would be >2TB.
If you're a curious ninja looking crawl data and don't want to start from scratch, my partial crawl file of Amazon should help. It's 10M pages and 96GB.
Then there are websites that restrict crawlers.
This was a bummer, I was really looking forward to visualizing some of my favorites:
With crawl rates down to 2 URLs/s, it would take weeks or even months to fully crawl these sites. Can't hold it against them though, it's best to keep crawlers in check unless you're packing some serious server stacks.
Website structures determine how effectively link equity flows to different pages. If the site is not tightly knit then you will end up having orphaned pages which Google will classify as less important compared to other pages on the site, because fewer internal links are pointing to those pages.
This is especially challenging to larger ecommerce websites that are adding new products regularly; on a poorly structured website with lots of categories, new product pages would not receive enough link juice, and as a result, suffer the ranking consequences.
In retrospect, I should have focussed on smaller to medium-sized websites and stayed away for 500K+ page websites since at those levels the complexity of network analysis is insane. For example, visualizing Otto.com with Gephi took 12 hours on a 24core, 64GB RAM virtual machine and the structure was still too dense to make any sense out of it.
For future research projects, I would love to get deeper into this while focussing on fewer websites and try and understand how site structures vary by platform (Magento, Shopify, Hybris, Salesforce & Bigcommerce) and how that impacts ranking.