February 2014 - Swiftype Blog | The Swiftype Blog

Here at Swiftype, one of the ways we work hard to improve our product is to use various analytics tools (including our own search analytics!) to watch how people are using our website, so we know what’s working for our customers and what isn’t. We use (and love!) Mixpanel, Google Analytics, and in-house tools built around various databases and log files.

When you’re doing these kinds of analytics, giving each user — or, especially, visitor — a unique ID is paramount. For logged-in users, this is easy; you just use their user ID. For visitors, you need to generate some kind of synthetic ID and use that.

This is easy enough to do: generate a UUID or sufficiently-large random number, hand it to the visitor in a cookie that never expires, and be done with it. wipes hands Done!

…well…almost.

There’s one problem — and it’s a big deal. Most web servers have, at best, a lot of difficulty logging outbound Set-Cookie headers; they typically only log inbound Cookie headers from cookies the client already has. This means you won’t get this generated ID in the log line for the very first request the user makes — and this is the very most important single HTTP request they will ever make, because it tells you how they found your site and what page they landed on. You can see what’s going on below; the cookie simply isn’t present in the request that your server is logging:

Fortunately, both Apache and nginx provide modules that solve this problem quite nicely. Apache’s mod_uid and nginx’s httpuseridmodule both can generate a unique token for each visitor, issue it to them in a cookie, and add it to your HTTP log file, even on the first request.

Let’s do this with nginx, by adding the following to our /etc/nginx/nginx.conf (by default, nginx compiles in support for http_userid_module already):

userid on;
userid_name brid;
userid_path /;
userid_expires max;

proxy_set_header X-Nginx-Browser-ID-Got $uid_got;
proxy_set_header X-Nginx-Browser-ID-Set $uid_set;

This tells nginx to generate a unique ID for all requests, to store it in a cookie named brid, to set its expiration to the maximum time allowed, and to pass it back to our Rails site in a header. There are actually two headers, because $uid_got will contain any inbound value for the ID ( i.e. , from a cookie the client already had) and $uid_set will contain an outbound value for the ID ( i.e. , that which nginx generated for this request).

Now, our situation looks like this:

We can read this value from Rails by looking at request.env['HTTP_X_NGINX_BROWSER_ID_SET'], which will contain a value like brid=D07FA8C019EA0753B600AD0F02030303. If we configure our nginx logs to output the contents of $uid_set, we’ll get it there, too. On the next request, we’ll get the exact same string in request.env['HTTP_X_NGINX_BROWSER_ID_GOT'], instead, because nginx is telling us it’s a UID passed by the client, rather than generated by nginx itself:

Past the first request, we’ll also get the value in cookies[:brid] — but in a different format. nginx sends us the ID as a hex string, but the cookie it sends to the client is Base64-encoded, so, while it represents the same value, it looks like wKh/0FMH6hkPrQC2AwMDAg== instead.

These are two incompatible formats, and neither of them is ideal if you want to store these values in your database — you may wish to save on precious buffer cache by using the most-compact possible format of pure binary data. Further, these IDs actually have internal structure; they’re generated using things like the IP address of the server, the start time of the web server process, the process ID, and a sequence token, which can be of use.

To help with this, we’re proud to offer WebServerUid, a small Ruby class that can:

Read any of these formats (hex, Base64, or binary);
Return any of these formats;
Compare and hash itself cleanly ( i.e. , <, >=, <=>, ==, and hash all work correctly);
Expose the internal structure of the UID (things like the IP address of the generating server, the PID of the generating process, and that process’s start time are in there);
Generate new UIDs from scratch, using the same algorithm as Apache and nginx.

Using WebServerUid, it takes about one minute to configure your web server to generate unique IDs for visitors and have them easily accessible in your Rails application. Our ApplicationController contains something like this:

def current_browser_id
  WebServerUid.from_header(request.env['HTTP_X_NGINX_BROWSER_ID_SET'], 'brid') ||
  WebServerUid.from_header(request.env['HTTP_X_NGINX_BROWSER_ID_GOT'], 'brid') ||
  WebServerUid.from_base64(cookies['brid'])
end

…and we can now store simply current_browser_id.to_binary_string in our database for a highly-efficient storage format; we can use this class on the way out to transform it to a more human-readable format.

Using these techniques, you can have unique IDs added throughout your Web stack in a matter of a half-hour or so. Enjoy!

If you enjoyed this post, subscribe to our blog newsletter for more tips such as this one or transparently storing MongoDB BSON IDs in a RDBMS.

Here at Swiftype, we use both MongoDB and MySQL to store some of our core metadata — not search indexes themselves, but users, accounts, search engines, and so on. As we’ve migrated data from MongoDB to MySQL, we’ve found ourselves needing to store the primary keys of MongoDB documents in MySQL.

While it’s possible to use more-or-less arbitrary data in MongoDB as your _id, very, very frequently you will simply use MongoDB’s built-in ObjectId type. This is a data type similar in concept to a UUID; it can be generated on any machine at any time, and the chance it will be globally-unique is still extremely high. Some relational databases offer native support for UUIDs; we thought, why shouldn’t we teach Rails how to get as close to that ideal as possible with ObjectIds, too?

The result has been our objectid_columns RubyGem, which we are proud to release as open source under the MIT license. Using ObjectIdColumns, you can store MongoDB ObjectId values as a CHAR(24) or VARCHAR(24) (which stores the hexadecimal representation of the ObjectId in your database), or as a BINARY(12), which stores an efficient-as-possible binary representation of the ObjectId value in your database.

No matter how you choose to store this data, it’s automatically exposed from your ActiveRecord models as an instance of the bson gem’s BSON::ObjectId class, or the moped gem’s Moped::BSON::ObjectId class. (ObjectIdColumns is compatible with both equally; the two are extremely similar.)

my_model = MyModel.find(...)
my_model.my_oid # => BSON::ObjectId('52eab2cf78161f1314000001')

You can assign values as an instance of either of these classes, or as a String representation of an ObjectId — in either hex or pure-binary forms — and it will automatically translate for you:

my_model.my_oid = BSON::ObjectId.new # OK
my_model.my_oid = "52eab32878161f1314000002" # OK
my_model.my_oid = "R\xEA\xB2\xCFx\x16\x1F\x13\x14\x00\x00\x01" # OK

ObjectIdColumns even transparently supports queries; the following will all “just work”:

MyModel.where(:my_oid => BSON::ObjectId('52eab2cf78161f1314000001'))
MyModel.where(:my_oid => '52eab2cf78161f1314000001')
MyModel.where(:my_oid => 'R\xEA\xB2\xCFx\x16\x1F\x13\x14\x00\x00\x01'))

Enjoy! Head on over to the objectid_columns GitHub page for more details, or just drop gem ‘objectid_columns’in your Gemfile and go for it!

If you enjoyed the tips in this tutorial, make sure to bookmark our blog and subscribe for more announcements like our new Swiftype Ruby Gem.

WebServerUid: Easy Unique Browser IDs for Rails & Better Analytics

ObjectIdColumns: Transparently Store MongoDB BSON IDs in a RDBMS

Subscribe to our blog