Congrats on your Swiftype implementation! You’ve already started delivering a powerful site search experience for your visitors. Trust me, they’ll thank you for it.
We’ve been listening to our community and we’re here to answer 4 of the most commonly asked questions from Swiftype customers.
1. Why do my results all look the same?
When Swiftype’s crawler, lovingly called “Swiftbot”, crawls a domain, there’s potential for repetitive template elements to be indexed with the page body. Most notably the navigation/header, sidebars, and/or footer content. With all the great, meaningful content encased in template noise, the quality of the customer’s search experience can be negatively affected.
For example, Apple.com has these elements on their website:
Since these elements appear on all pages, when their site it indexed, in the Swiftype dashboard, it looks like this:
You can easily clean this up by using our Content Inclusion/Exclusion tag recognition. By adding Swiftype specific meta tags (data-swiftype-index=’true’) to the HTML container(s) that holds the primary page content, it’s possible to instruct Swiftbot to index only those sections of the page body.
The best practice is to set the main content container to true. If you want to further refine what’s indexed from that section, you can add additional tags with a value of false to containers nested within.
<body>
<nav>Blah Blah blah</nav>
<div id="main_content" data-swiftype-index='true'>
<p>All of my sweet, sweet 5/7 content is going to go in here.</p>
<div id="ad_widget" data-swiftype-index='false'>This bit really isn't as important which is why I'm going to add a 'false' exclusion parameter.</div>
<p>This bit will be indexed though, because it’s still within the ‘main_content’ div that’s set to 'true'. Everything outside of the ‘main_content’ div container will be ignored, yo.</p>
</div>
<footer>Copyright Attempting to Sound Official© 2016</footer>
</body>
2. How do I prevent Swiftype from indexing certain pages of my site?
For crawler based engines there are three approaches you can take to determine what pages are indexed from your domains: URL path rules, a customized robots.txt file, and robots meta tags.
Path Rules:
From the Manage > Domains section of the Swiftype customer dashboard, there’s an option for each domain that will allow you to ‘Manage Rules’ for that domain. From here, you can define specific paths to include (Whitelist) or exclude (Blacklist) when crawling your site.
Examples of common cases are the exclusion of /category/ paths for ecommerce sites, so the focus is exclusively on crawling product pages. For other CMS based sites, you’d likely see paths to login or administrative pages excluded, as well as dynamically generated content, like pages based on tags or categories. More examples and tips on using this feature can be found here.
Robots.txt Files:
A robots.txt file is a plaintext document that you can upload to the root directory of your website’s domain. With the robots.txt file, you can define URL path exclusion rules for all or only specific web crawlers to follow. Many websites will commonly have a robots.txt file already in place, and it’s presence is one of the first things Swiftbot will look for when starting a crawl process.
Check out our Robots.txt documentation to learn how you can leverage this with Swiftype.
Robots meta tags:
If you need to exclude content in a more precise way (page by page or page template basis), we recommend and fully support robots meta tags. We adhere to the robots tag standard that’s a companion to the aforementioned robots.txt file.
This means that we’ll pass over any page we attempt to crawl that contains a meta tag:
<meta name="robots" content="noindex">
You can also configure these tags so they only apply to Swiftype’s web crawler:
<meta name="st:robots" content="noindex">
Similar to the robots.txt file, meta tags are configured and managed outside of the Swiftype dashboard via their webhost/CMS.
You guessed it. We’ve got documentation of robots meta tags support here.
3. “Why are pages missing from my search engine?”
Here are 3 reasons why Swiftbot, our crawler, may not be able to locate and index pages on your site:
A. We’re unable to find the content because it’s not linked to from other pages.
When spidering a domain, the Crawler will examine all links within a page to discover URLs that are part of the domain submitted to the engine and also adhere to any configured path rules (see question 2 above). If content exists within a site that is not linked to from another known page, within the site’s navigation menu, or listed on a domain’s sitemap, chances are Swiftbot will not be able to locate it.
One of the best ways to ensure our Crawler is able to index all desired content is to include a current sitemap. Sitemaps are files that are typically stored at the root of a URL domain and contain a list of links to pages on their site(s) that are available for crawling. Our documentation for sitemap support and installation notes can be found here.
B. Improperly configured canonical URL elements/tags.
A canonical link element is often used like a meta tag, in order to prevent duplicate content indexing issues by pointing web crawlers to the preferred (or canonical) URL version of a web page. A scenario that can occur is that a customer’s site content is configured with static canonical tags that all point to the root domain URL.
Due to this misconfiguration, even if the crawler finds links to all the content on the site, it is being given instructions that all pages on the site are another version of the home page. With that directive in place, only the home page will be indexed. If the customer re-configures or removes those elements, a recrawl will be able to index their content successfully.
For best practices on canonical elements, you can refer to Google’s documentation here.
C. The content is being excluded by one of the methods noted in question 2.
Just as with misconfigured canonical link elements, misconfiguration or conflicts in a customer’s path or robots rules can cause pages to be skipped over.
4. My site is password protected / behind a firewall / hosted on our company’s intranet.
It is possible for Swiftbot to crawl secured content, but you’ll first need to make minor configuration changes to your web or intranet site’s host server.
All Swiftype accounts have an account specific User-Agent ID string. By whitelisting this identification string with your server, you can allow, or disallow, crawlers access to your site’s content.
Swiftype has a unique security feature where we encode our crawler’s User-Agent with a secure key that is uniquely tied to your Swiftype account. This approach enables you to limit access solely to Swiftype’s crawler, and is an extra level of security many customers enjoy.
If you’re interested in using the Swiftbot web crawler to access your secured content, please contact our support team and we’ll be happy to supply you with your account specific User-Agent ID string.
Hopefully these answers to commonly asked questions will point you in the right direction. If you ever have questions, suggestions, or feedback, you can always email [email protected] to reach our team. We’re happy to help!