How to Find All Present and Archived URLs on a Website

There are various reasons you could want to locate every one of the URLs on an internet site, but your actual aim will decide Whatever you’re attempting to find. For illustration, you might want to:

Discover every single indexed URL to research problems like cannibalization or index bloat
Collect present-day and historic URLs Google has found, specifically for web site migrations
Discover all 404 URLs to recover from post-migration glitches
In Every single situation, just one Software received’t Present you with every little thing you'll need. Unfortunately, Google Look for Console isn’t exhaustive, plus a “web site:example.com” research is restricted and difficult to extract details from.

In this particular submit, I’ll wander you thru some resources to make your URL listing and prior to deduplicating the info utilizing a spreadsheet or Jupyter Notebook, based on your internet site’s dimensions.

Previous sitemaps and crawl exports
If you’re in search of URLs that disappeared from your live web-site not too long ago, there’s an opportunity an individual on your staff might have saved a sitemap file or simply a crawl export prior to the changes ended up produced. When you haven’t presently, look for these files; they will usually deliver what you will need. But, for those who’re reading through this, you almost certainly did not get so Blessed.

Archive.org
Archive.org
Archive.org is an invaluable Device for Search engine marketing responsibilities, funded by donations. In the event you seek for a domain and select the “URLs” possibility, you'll be able to obtain as much as ten,000 mentioned URLs.

Nonetheless, Here are a few limitations:

URL Restrict: You could only retrieve around web designer kuala lumpur ten,000 URLs, which is insufficient for much larger sites.
Top quality: Quite a few URLs might be malformed or reference source documents (e.g., images or scripts).
No export alternative: There isn’t a designed-in way to export the checklist.
To bypass The shortage of the export button, utilize a browser scraping plugin like Dataminer.io. However, these restrictions indicate Archive.org may well not provide a complete Remedy for much larger web pages. Also, Archive.org doesn’t indicate no matter whether Google indexed a URL—but when Archive.org uncovered it, there’s a good likelihood Google did, way too.

Moz Professional
When you might usually use a website link index to seek out exterior web-sites linking to you personally, these applications also discover URLs on your site in the process.


How you can use it:
Export your inbound backlinks in Moz Pro to obtain a rapid and simple list of goal URLs from the internet site. When you’re addressing a large Web-site, think about using the Moz API to export details further than what’s workable in Excel or Google Sheets.

It’s crucial that you Take note that Moz Pro doesn’t affirm if URLs are indexed or discovered by Google. On the other hand, because most web-sites utilize the identical robots.txt procedures to Moz’s bots as they do to Google’s, this process commonly works effectively to be a proxy for Googlebot’s discoverability.

Google Search Console
Google Lookup Console delivers quite a few beneficial sources for constructing your listing of URLs.

One-way links reports:


Much like Moz Pro, the Back links part presents exportable lists of target URLs. Sadly, these exports are capped at 1,000 URLs Every single. You may utilize filters for specific pages, but because filters don’t apply towards the export, you could ought to trust in browser scraping instruments—restricted to 500 filtered URLs at a time. Not perfect.

Efficiency → Search Results:


This export gives you a list of internet pages getting lookup impressions. Although the export is limited, You can utilize Google Search Console API for larger datasets. Additionally, there are free Google Sheets plugins that simplify pulling more considerable facts.

Indexing → Web pages report:


This segment offers exports filtered by issue form, however they're also restricted in scope.

Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is a superb source for accumulating URLs, using a generous limit of one hundred,000 URLs.


Even better, you could use filters to make unique URL lists, properly surpassing the 100k Restrict. By way of example, in order to export only blog URLs, comply with these actions:

Action 1: Insert a segment on the report

Step 2: Click on “Create a new section.”


Move 3: Outline the phase using a narrower URL sample, such as URLs that contains /blog/


Notice: URLs located in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they offer precious insights.

Server log files
Server or CDN log information are Potentially the ultimate Device at your disposal. These logs seize an exhaustive record of every URL path queried by end users, Googlebot, or other bots in the recorded time period.

Issues:

Details dimension: Log information is often massive, a lot of internet sites only keep the final two months of knowledge.
Complexity: Examining log data files can be difficult, but a variety of instruments are offered to simplify the procedure.
Incorporate, and good luck
After you’ve gathered URLs from every one of these resources, it’s time to mix them. If your web site is sufficiently small, use Excel or, for larger datasets, tools like Google Sheets or Jupyter Notebook. Make sure all URLs are constantly formatted, then deduplicate the checklist.

And voilà—you now have a comprehensive list of present, previous, and archived URLs. Great luck!

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Comments on “How to Find All Present and Archived URLs on a Website”

Leave a Reply

Gravatar