List Crawl

List Crawl

Read our success stories

Testimonials

Home
Learn Sitebulb
Documentation
Crawl & Audit Settings
How to crawl in list mode

Sign up for our monthly newsletter for the best technical SEO content.
Find out how to crawl websites that render content using JavaScript.
By default Sitebulb is set to 'crawl' a website from the start URL you supply - which means finding all the links on the page and following these in turn. However, in some circumstances you may wish to only crawl a specific set of URLs on a website without following the links.
In such cases, you would need to utilise Sitebulbs's 'list mode.'
When you first open up the audit setup screen, the left hand side will have 'Audit Data' selected, and the options will show on the right hand side. There are lots of Audit Data options, so we have a dedicated documentation page for this .
Select the 'Crawl Sources' option from the left hand menu. This is the only audit setting in Sitebulb that is not optional. Sitebulb needs at least one crawl source, otherwise it cannot crawl!
The default setting is for Sitebulb to crawl the website, so this will always be ticked by default. However it can be configured to also crawl XML Sitemap URLs, and/or a provided URL List.
To get Sitebulb to 'crawl' based on a list, check the 'URL List' option. To add a URL List, simply upload a .csv or .txt file from your local computer.
It isn't strictly crawling, as links from the pages will not be followed, but the data will be collected and analysed for all URLs contained in the list. Typically URL Lists are used when you DON'T also crawl the website, and are used to crawl a specific area or section of the site. If you wish to do this, make sure to uncheck the 'Crawl Website' option.
Please note that Sitebulb will only crawl URLs that match the subdomain of the start URL provided (so you can't just upload a massive list of URLs from lots of different sites).
Want to find out how Sitebulb can make your SEO auditing easier and more actionable?
Try Sitebulb for free, and we'll plant a tree!

Males

$ 99

Females

$ 99

Private Tour

$ 2,970

Males

$ 99

Females

$ 99

Private Tour

$ 2,970

7:30 pm Meet your World Class Party Host in front of The Gatsby located in the brand new Resorts World! From here we will explore the new hotel as we head to our curbside hotel pick up in the sickest Party Buses Las Vegas has to offer!
We then are off for a 2.5-hour sightseeing tour and booze cruise along the strip! Make stops at the iconic Welcome to Las Vegas Sign, Bellagio Water Show, and Fremont Streets Viva Vision!
10:30 pm Hosted entry including cover charge into 1 of Las Vegas Best Nightclubs!

Lower pane tabs

Output
Histogram
HTTP issues
Non-crawled items
Search Results
Page Links
Page Images
hreflang Viewer
Page language directives
H1-H6
SERP preview
Validation
ViewState viewer
Page CSS/JS files

Right pane tabs

Progress
Session
Properties
Content
DOM
Plain text
Screenshot

This feature permits to crawl URL lists from various domains in order to audit a site backlink profile.

You can import backlink URLs from all major backlink intelligence providers, the program recognizes their proprietary CSV formats.
You can import URLs from multiple sources, the original lists will be merged and duplicates discarded.

To learn more about the feature, please read the Crawl URL Lists: off-site analysis page.

URLs can be imported from the clipboard memory, if you copied a text with a list of URLs.
Clicking on the Add from Clipboard button will import the copied URLs. Text rows not recognized as URLs will be skipped.

You can import URLs from CSV files exported from Google Search Console, Bing Webmaster Tools, Yandex.Webmaster, and from all major backlink intelligence providers.
Clicking on the Add from CSV file... button this will expand to let you choose the desired CVS format, and then will open a window to select the CSV file and preview the URLs to import.
If you are not sure about what CSV format you have, don't worry: you wil be able to test the different import schemes and change the one to use before importing the URLs.

When crawling a list to audit a backlink profile, it's highly recommended to specify which domain the analysis will refer to.
This way the spider will also crawl the destination URLs when it finds links pointing to such domain.

The total number of unique URLs imported into the list.
This number could be be less then the number in the original list, because duplicates are removed.

You can give your crawl session an optional descriptive name for your own convenience. You will also be able to add it or change it at a later time.
This tab sheet lists all imported URLs to be crawled.
Column holding the imported URLs of the pages to crawl.

This tab sheet lists all distinct domain names, extracted from the imported URLs.
Note: the list can be exported to Excel/CSV.

The number of URLs in the list belonging to the domain.

This tab sheet lists all distinct sub-domain names, extracted from the imported URLs.
Note: the list can be exported to Excel/CSV.

The number of URLs in the list belonging to the sub-domain.
Clicking on the Show options link will expand the window to let you access further crawl parameters.

When auditing page contents under a SEO perspective, you normally worry only about pages returning a 200 OK status code, because only such pages can be indexed by search engines.
Nevertheless there are various reasons to with to analyze them anyway: check for the Analytics tracking code, check whether the page is user-friendly, and so on...
We suggest to disable the option only in those rare case when you need to explore huge sites with many HTTP errors and you need to save as much disk space as possible.

Selecting this option will make the spider ignore the Disallow: directives read in the robots.txt file that would normally prevent visiting some website paths.

According to the original robots.txt specifications, a missing file (404 or 410) should be interpreted as "allow everything" and all other status code should be interpreted as "disallow everything".
Google made the despicable choice to treat some status codes such as 401 "Unauthorized" and 403 "Forbidden" as "allow everything" as well, even if semantically they would mean the contrary!
In order to be able to reproduce Google behavior we added this option, which by default is not selected.

According to the original robots.txt specifications, a missing file (404 or 410) should be interpreted as "allow everything" and all other status code should be interpreted as "disallow everything".
A redirection should thus be interpreted as "disallow everything"; unfortunately it is not a rare setting to redirect to the root address (i.e. to the Home Page) a missing file, with the generic rule that applies to a missing the robots.txt as well. It is a disputable practice (Google for example treats generic redirection to the Home Page as "soft 404s"), but common enough anyway that Google made the choice to tolerate this specific case interpreting it like a 404 (besides the webmaster intention should be respected this way).
In order to be able to reproduce Google behavior we added this option, which by default is not selected.

According to the original robots.txt specifications, a missing file (404 or 410) should be interpreted as "allow everything" and all other status code should be interpreted as "disallow everything".
A redirection should thus be interpreted as "disallow everything"; unfortunately it is a common scenario in case like a HTTP->HTTPS migration, or a domain name change, to redirect everything from the old to the new version, robots.txt included.
To permit auditing a site after a HTTP to HTTPS migration when the given Start URL is using the http:// protocol (or the protocol is not specified and http:// is assumed) we added this option, selected by default.

SEO spiders try to speed up website visits by using multiple concurrent HTTP connections, i.e. requesting more web pages at the same time.
Visual SEO Studio does the same, even if its adaptive crawl engine can decide to push less if it detects that web server would get overloaded.
This control lets you tell the spider how much it could push harder if the web server keeps responding fast.

Maximum limit is 5. The free Community Edition can only use 2 concurrent connections at most.
Warning: increasing the number of thread could slow down or hang the server if it cannot keep up with the requests; do it at your own risk (that's why you can force more on verified sites only).

aStonish Studio Srl
P.I. 02342450992 – C.F. 02342450992 - R.E.A. Nr. 478864
Company listed in the Business Register of Genova
Share capital 15000€ fully paid-up
Registered office : Via Luccoli, 23/2A - Genova (GE) - Italy

Privacy and Cookies - Terms of Use
Copyright © 2012-2022 - All rights reserved
Visual SEO Studio ® is a registered trademark

This website uses cookies. By continuing to use our website, we assume you accept it. Cookie policy

Table of contents

Exit focus mode

Light

Dark

High contrast

Light

Dark

High contrast

This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
APPLIES TO: 2013 2016 2019 Subscription Edition SharePoint in Microsoft 365
You can add a crawl rule to include or exclude specific paths when you crawl content. When you include a path, you can provide alternative account credentials to crawl it. In addition to creating or editing crawl rules, you can test, delete, or reorder existing crawl rules.
Use crawl rules to do the following:
Prevent content on a site from being crawled. For example, if you created a content source to crawl 'http://www.contoso.com' , but you do not want the search system to crawl content from the subdirectory 'http://www.contoso.com/downloads' , create a crawl rule to exclude content from that subdirectory.
Crawl content on a site that would be excluded otherwise. For example, if you excluded content from 'http://www.contoso.com/downloads' from being crawled, but you want content in the subdirectory 'http://www.contoso.com/downloads/content' to be crawled, create a crawl rule to include content from that subdirectory.
Specify authentication credentials. If a site to be crawled requires different credentials than those of the default content access account, create a crawl rule to specify the authentication credentials.
You can use the asterisk (*) as a wildcard character in crawl rules. For example, to exclude JPEG files from crawls on 'http://www.contoso.com' , create a crawl rule to exclude 'http://www.contoso.com/*.jpg' .
The order of crawl rules is important, because the first rule that matches a particular set of content is the one that is applied.
Verify that the user account that is performing this procedure is an administrator for the Search service application.
In Central Administration, in the Application Management section, click Manage Service Applications .
On the Manage Service Applications page, in the list of service applications, click the Search service application.
On the Search Administration page, in the Crawling section, click Crawl Rules . The Manage Crawl Rules page appears.
To create a new crawl rule, click New Crawl Rule . To edit an existing crawl rule, in the list of crawl rules, point to the name of the crawl rule that you want to edit, click the arrow that appears, and then click Edit .
On the Add Crawl Rule page, in the Path section:
In the Path box, type the path to which the crawl rule will apply. You can use standard wildcard characters in the path.
To use regular expressions instead of wildcard characters, select Use regular expression syntax for matching this rule .
Exclude all items in this path . Select this option if you want to exclude all items in the specified path from crawls. If you select this option, you can refine the exclusion by selecting Exclude complex URLs (URLs that contain question marks (?)) to exclude URLs that contain parameters that use the question mark (?) notation.
Include all items in this path . Select this option if you want all items in the path to be crawled. If you select this option, you can further refine the inclusion by selecting any combination of these options:
Follow links on the URL without crawling the URL itself . Select this option if you want to crawl links contained within the URL, but not the starting URL itself.
Crawl complex URLs (URLs that contain a question mark (?)) . Select this option if you want to crawl URLs that contain parameters that use the question mark (?) notation.
Crawl SharePoint Server content as http pages . Normally, SharePoint Server sites are crawled by using a special protocol. Select this option if you want SharePoint Server sites to be crawled as HTTP pages instead. When the content is crawled by using the HTTP protocol, item permissions are not stored.
In the Specify Authentication section, perform one of the following actions:
This option is not available unless the Include all items in this path option is selected in the Crawl Configuration section.
To use the default content access account, select Use the default content access account .
If you want to use a different account, select Specify a different content access account and then in the Account box, type the user account name that can access the paths that are defined in this crawl rule. Next, in the Password and Confirm Password boxes, type the password for this user account. To prevent basic authentication from being used, select the Do not allow Basic Authentication check box. The server attempts to use NTLM authentication. If NTLM authentication fails, the server attempts to use basic authentication unless the Do not allow Basic Authentication check box is selected.
To use a client certificate for authentication, select Specify client certificate , expand the Certificate menu, and then select a certificate.
To use form credentials for authentication, select Specify form credentials , type the form URL (the location of the page that accepts credentials information) in the Form URL box, and then click Enter Credentials . When the logon prompt from the remote server opens in a new window, type the form credentials with which you want to log on. You are prompted if the logon was successful. If the logon was successful, the credentials that are required for authentication are stored on the remote site.
To use cookies, select Use cookie for crawling , and then select Obtain cookie from a URL to obtain a cookie from a website or server. Or, select Specify cookie for crawling to import a cookie from your local file system or a file share. You can optionally specify error pages in the Error pages (semi-colon delimited) box.
To allow anonymous access, select Anonymous access .
Verify that the user account that is performing this procedure is an administrator for the Search service application.
In Central Administration, in the Application Management section, click Manage Service Applications .
On the Manage Service Applications page, in the list of service applications, click the Search service application.
On the Search Administration page, in the Crawling section, click Crawl Rules .
On the Manage Crawl Rules page, in the Type a URL and click test to find out if it matches a rule box, type the URL that you want to test.
Click Test . The result of the test appears below the Type a URL and click test to find out if it matches a rule box.
Verify that the user account that is performing this procedure is an administrator for the Search service application.
In Central Administration, in the Application Management section, click Manage Service Applications .
On the Manage Service Applications page, in the list of service applications, click the Search service application.
On the Search Administration page, in the Crawling section, click Crawl Rules .
On the Manage Crawl Rules page, in the list of crawl rules, point to the name of the crawl rule that you want to delete, click the arrow that appears, and then click Delete .
Click OK to confirm that you want to delete this crawl rule.
Verify that the user account that is performing this procedure is an administrator for the Search service application.
In Central Administration, in the Application Management section, click Manage Service Applications .
On the Manage Service Applications page, in the list of service applications, click the Search service application.
On the Search Administration page, in the Crawling section, click Crawl Rules .
On the Manage Crawl Rules page, in the list of crawl rules, in the Order column, specify the crawl rule position that you want the rule to occupy. Other values shift accordingly.

Reddit Festivalsluts
Sexy Latex Skirt
Latina Hard Core

List Crawl

Report Page