The Swiftype Blog / Month: August 2012

Sitemap.xml Support for Swiftype

At Swiftype we’re always working on new ways to improve the quality of the crawl of your website, and today we’re announcing Swiftype crawler support for the Sitemap.xml protocol.

The Sitemap.xml protocol is a well-documented and widely implemented standard for specifying exactly which set of URLs you would like web crawlers to index on your website, and if your website supplies a sitemap.xml file to our crawler we will dutifully follow your specifications as our crawler builds a search index for your website.

If you aren’t familiar with Sitemap.xml files, we’ll take you through a quick tutorial here, and there is additional information in our documentation section as well as the official protocol page.

To get started, create a simple sitemap.xml file. An example sitemap.xml that specifies 3 URLs might look as follows:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://www.yourdomain.com/</loc>
  </url>
  <url>
    <loc>http://www.yourdomain.com/faq/</loc>
  </url>
  <url>
    <loc>http://www.yourdomain.com/about/</loc>
  </url>
</urlset>

Next, you’ll put the sitemap.xml file on your web server at a location that is accessible by our crawler. Many sites place the sitemap at the root of the domain (i.e. http://www.yourdomain.com/sitemap.xml), but any location is fine. Whatever location you choose, you should specify the location in your Robots.txt file as follows:

User-agent: *
Sitemap: http://www.yourdomain.com/sitemap.xml

If you’re unfamiliar with the Robots.txt file, you can find more information at the official Web Robots page.

Once your robots.txt file is updated and your sitemap.xml file has been uploaded you’re finished. The next time the Swiftype crawler visits your website we’ll recognize your sitemap.xml file follow the links you specify.

As always, if you’re having trouble or want more information, feel free to get in touch. Also, don’t forget to follow the blog so you don’t miss out on great content from our friends like Bob Hiler from Mixergy.

Exclude Unwanted Content with Swiftype

Are there parts of your site you won’t want indexed? We’ve got you covered.

To exclude parts of your website by path, you can use Path Exclusions. You can exclude pages starting with, containing, or ending with the text you specify. For advanced users, we also support regular expression matches.

To add a path exclusion, click on a crawler-based engine, then select the Domains tab, then the domain to which you want to add path exclusions.

 

As you type your exclusion, we’ll show you a sample of the pages that will be removed from the index.

Once you’re happy with the exclusions, hit the Recrawl button to put them into effect.

On an individual page, you can exclude content (for example, your header or footer) by adding the data-swiftype-index attribute set to false.

Here’s an example:

An example page with content exclusion
  

 

This is your page content, which will be indexed by the Swiftype crawler.

This content will be indexed, since it isn’t surrounded by an excluded tag.

 

By combining Path Exclusions and Content Exclusion, you can precisely control how your website is indexed by Swiftype.

As always, if you have trouble, please reach out.