Dumb SEO Questions

(Entry was posted by Edvinas Pozniakas on this post in the Dumb SEO Questions community on Facebook, Thursday, April 12, 2018).

I have a very large website and it needs a lot of sitemaps

Hey,

I have a very large website and it needs a lot of sitemaps in one sitemap.xml file.

1. How many url`s should be in one sitemap? 1000?
2. What tool should I use (and How) to crawl different parts of website to get those different sitemaps? I tried to use Screaming Frog but website has so many pages that tool can`t crawl all url`s...
3. Some other tips on generating sitemaps for large website?


This question begins at 00:51:34 into the clip. Did this video clip play correctly? Watch this question on YouTube commencing at 00:51:34
Video would not load
I see YouTube error message
I see static
Video clip did not start at this question

YOUR ANSWERS

Selected answers from the Dumb SEO Questions G+ community.

  • Dave Elliott: use screaming frog but change your system from memory to storage. Managed to crawl a 4.5million url website using my hardrive the other week!
  • Sunny Ujjawal: All formats limit a single sitemap to 50MB (uncompressed) and 50, 000 URLs. If you have a larger file or more URLs, you will have to break your list into multiple sitemaps. You can optionally create a sitemap index file (a file that points to a list of sitemaps) and submit that single index file to Google. You can submit multiple sitemaps and/or sitemap index files to Google.
  • Alan Bleiweiss: Yeah and with the 50, 000 url limit, that means your sitemap index can list 50, 000 sitemaps.
  • Alan Bleiweiss: Be careful though. Don`t generate those sitemaps real time at file request. No dynamic file creation. That can kill the server. Make them static.
  • Benj Arriola: 1. I suggest ways to dynamically so you do not need to update this all the time.2. If your site is that large, I would only assume that this is database driven in some way in some kind of CMS. Either a popular commonly used CMS or a custom-built one, either way the data for the URLs should be in the database. It is better to build an XML sitemap directly from the database and not from crawling especially if the site is very large. It slows down the site, by loading too many server resources.3. If you are using a common, popular CMS, check if it is extensible, which most modern CMS are. Find a plugin/addon/module/extension that creates the xml sitemap for you.4. If the CMS is not extensible, then get a web developer to pull in the links directly from the database.5. If you are not using a CMS and it is all static HTML, no programming, no database at all, then yes, use a crawler like Screaming Frog, Botify, DeepCrawl, or even Xenu to build your XML sitemap. And I highly suggest start looking into solutions to migrate everything into a CMS.
  • Dawn Anderson: XML-Unlimited Sitemap Generator. It’s like a rocket
  • Dawn Anderson: Edvinas - You don`t appear to have even asked what this potential solution is or provides. Here is more information. https://pro-sitemaps.com/
  • Dawn Anderson: or https://www.xml-sitemaps.com/
  • Dawn Anderson: This can either be hosted on your own site or can be hosted by their platform. It costs virtually nothing tbh (certainly far less than the cost of a developers time to build something manually). I have found this a great resource.
  • Bill Hunt: I agree with Benj Arriola that there should be some way to do it in your system. A CMS that large would handle some sort of export. If you use a common format you can build the list in a database. Key concern should be to get an accurate picture and to remove errors in the files - try to minimize the number to zero so some sort of validation Your second question - the number of URL`s - my magic number is 10k. This comes from pushing billions of URL`s we found engines like that number. I tend to break them down by product category, country etc.
  • Dawn Anderson: Isn`t that quite a manual method or are you talking about building something dynamic?
  • Mal Ö Tonge: "httrack" is what i use for large sites, it will run for days if required and limits can also be set so it does not kill your server, its a great way to also build site pre-cache. best part its open source. Ive used it for 20 years for all sorts of site scraping and url list building. Also be careful that you sitemap is not to big, google only give you a certain quota amount, a site with so many pages it might be better to highlight the most important pages, last thing you want is your quota being wasted on thin useless content. Also with so many pages you need to make sure that the server can cope and is well optimized for speed. im not sure if this is a new website or not, but if it is then I would suggest you to introduce content slowly, a thousand page website appearing over night will raise a few eye brows. The masses does not work anymore, only quality counts now.
  • Benj Arriola: I’ve used WinHTTrack also for a very long time. But I haven’t used WebHTTrack so I am not sure if WebHTTrack can create an XML sitemap, but I believe WinHTTrack cannot. Yes it can crawl all links. It will make a duplicate copy of the site, but it will not generate an XML sitemap for you that is ready to submit to Google Search Console or Bing Webmaster Tools.WinHTTrack also does not save all original URLs in a single place, so it is also not easy to generate an XML sitemap from the saved data.If a URL has parameters, WinHTTrack will rewrite URL paths into static folder paths on your local machine. So if you try to use the folders and files that WinHTTrack has made to create your XML sitemap, they will not be the same.But yes... HTTrack gets you some idea of the structure of your site, the information architecture, a local copy of the site will give you a sitemap in a way for you to understand the site. But I assume the question is about creating an XML sitemap.
  • Mal Ö Tonge: Sorry I should of said, httrack is used to scrape and gather a list of urls, from the list we can create a site map using notepad and excel. I`m so old school. if you have root access then you can scrape the url using java library JSoup. but as this can put strain on your server and does require extra knowledge on limiting server processes I would use httrack if all other options fail. the last time i used for this was a site with 67k products, all plugins failed or timed out due to shared host, the last resort was the trusted httrack to scrape the site and build me a url list from the site.
  • Gerry White: What is the CMS ? I would try and avoid creating the sitemaps by crawling the site if possible and would typically go with a dynamic type of approach - when your website adds a page, it is added to the sitemap, this means that your site will in theory be far easier to crawl.If the site is built on something that has MySQL or MS SQL or something else to power the back end it would be better and easier to create custom queries, ideally clustering them by a page type or category
  • Edvinas Pozniakas: It`s a No CMS website
  • Gerry White: Edvinas Pozniakas no database either ?
  • Edvinas Pozniakas: Database +
  • Michael Martinez: Edvinas Pozniakas If you have any programming skill, or work with a programmer, download the database and extract the URLs from the data.
  • Edvinas Pozniakas: It`s a No CMS website. I tried screaming frog, storage method, but it needs powerful computer so I can`t crawl all website... What could be best solution in my case?
  • Mal Ö Tonge: use httrack to gather the url list, set the crawl limits also set max time to download at 1 second, this should fail the download but will still show the URLs in the list when when finished. Then with the url list you need to organize it into a recognizable structure, you can do this using excel spreadsheets and delimiters. once its all organized then copy into notepad and delete the invisible delimiters. its hard to explain how to do this in words. how many links are there in total? if you get the url list together using httrack, send me the file and i will convert it for you.
  • Benj Arriola: No CMS that is very large, you will often run into memory issues of your memory is not that large. Other options you can look into but can be more expensive are: Deepcrawl and Botify.A free crawler that runs on Windows is Xenu Link Sleuth. But like Screaming Frog, it’s a desktop application, so depending on the computer you are running it on it can also get memory issues for a very large site. But Xenu tends to use less memory than Screaming Frog for some reason. It also gathers less information but it can still easily generate an XML sitemap.
  • Jobin John: You need to watch this video by Google Webmasters. https://www.youtube.com/watch?v=y0TPINzAVf0

View original question in the Dumb SEO Questions community on Facebook, Thursday, April 12, 2018).

Reference Links

All Questions in this Hangout