Teaching Swiftbot to Intelligently Index Images

When creating search engines, the first and arguably most important step is indexing website information in a structured format that is optimized for a specific search algorithm. The specific information you index and the structure by which you organize this information (also known as the schema) dictates how your search engine will determine relevance, what your users can search by, and what information you can display in search results.

How does indexing work?
While there are numerous ways to customize and control the information you index in your Swiftype search engine (for example, via our API or one of our platform integrations) we aim to make this process as simple as possible for non-technical users by automatically indexing website information with Swiftbot—our high performance web crawler designed to index information from a specific URL.

Swiftbot allows non-technical users to get up and running with a working search engine in minutes by simply entering their website URL and letting Swiftbot index their website for them. A major component of Swiftbot’s technology is the logic that our engineering team has built in to parse website HTML and index it in a structured format that works with Swiftype’s advanced search algorithm and information retrieval method. (To learn more about the technical challenge of building a search engine, read our white paper on the subject, written for a non-technical audience).

Building an intelligent web crawler
Because almost every website is built and structured in a different way, teaching Swiftbot how to effectively read, sort, and organize information from a website’s HTML base is an ongoing challenge. While we do allow site owners to completely customize the default information Swiftbot indexes from your website with custom <meta> tags, not all users have the technical resources or knowledge to do this on their own, so Swiftbot is also built to make many of these indexing decisions on its own.

With every website structured differently, how do we teach Swiftbot to intelligently index this information?

Still, with websites differing so dramatically from one another, indexing the right information in the right format from each page is no easy task. In particular, identifying the most important image from a web page and associating that image with a search result is a multifaceted problem, since there are many images on every page and these images often have different filename structures and/or occupy different locations on a page.

Adding images to search results pages and autocomplete menus can create a much more engaging search experience.

Nevertheless, indexing images allows site owners to create much more engaging search experience, adding thumbnails of varying sizes to their autocomplete and search results that let users see a preview of the page content before selecting a result. So, in a recent update to Swiftbot, we’ve built in conditional logic that automatically indexes images from your website pages (provided there are no Swiftype specific image tags already in place).

How does Swiftbot decide which image is “best”?
To teach Swiftbot how to index the “best” image from web pages, we had to build in logic that would overcome a series of challenges that result from the varying nature of website pages.

As a starting point, we decided to leverage existing open graph <meta> tags (such as Facebook and Twitter <meta> tags) that many site owners use to prepare their content for sharing on social media platforms and other content distribution networks. By teaching Swiftbot to obey these <meta> tags if no Swiftype specific <meta> tags exist, we created hierarchical indexing logic that more intelligently sources images from existing website metadata.
Secondly, we know that many websites have a large number of images that repeat across many, if not every page on their website (for example: a company logo, images in the header, footer, and sidebar, author headshots, ads, etc.). To ensure these images are not considered the “best” image for a specific document, we built in logic that identifies and rules out these repeating elements as candidates. Similarly, we do not want to index advertisements, so we run any images on the page against an ad server blacklist to ensure these remain out of consideration.
Thirdly, we compared data in the alt attribute of each <img> with the url and <title> of that page, assigning a relevance score to those images based on how closely the alt description matched this page information.
Lastly, Swiftbot looks for common CSS classes and id’s to locate the main content area of each page—another step that helps rule out extraneous information such as the header, footer, and sidebar.

Taking all these pieces of information together, Swiftbot assigns the images on the page a relevance score and indexes the image it judges to be the “best” image for that document. As this new indexing process gains wider use and we gather feedback from customers, we will continually work to improve our image extraction technology over time.

Adding these images to search
Once these images are indexed from your website and in your search engine, the question becomes: how do I display these image thumbnails in my search results and autocomplete dropdown? While there are many ways to style your autocomplete and search results (including using Swiftype’s web components or jQuery library) the best choice for users with very little technical experience is the Result Designer, which allows users to style their search results entirely from the Swiftype dashboard without writing any additional code. To learn more about the Result Designer, watch our dedicated webinar explaining this tool and offering best practices advice from the Swiftype customer success team.

Learn more

Subscribe to our blog