searchmysite.net update: Seeding and scaling

Note: this post is now on blog.searchmysite.net at https://blog.searchmysite.net/posts/searchmysite.net-update-seeding-and-scaling/.

Introduction Link to heading

It has been just over 2 months since I launched https://searchmysite.net. I’ve had some good feedback from the IndieWeb community in that time, and made some key changes as a result, so thought it was time for an update. You may still want to refer to searchmysite.net: Building a simple search for non-commercial websites for the original overview.

The main changes can be summarised as seeding and scaling:

  1. “Pre-loading” the search index with several hundred personal websites, so the results set will be much richer from the outset.
  2. Allowing site submissions from non-verified site owners, so the search index can grow more quickly.
  3. Improving the relevancy tuning, so that that it can continue returning good results from a much larger and potentially more noisy search index.
  4. Redesigning the Search and Browse pages, to facilitate navigation of considerably more sites.
  5. Upgrading the servers, to allow it to cope with the larger indexing load and index size.

I’ve also replaced references to “the non-commercial web” and “independent websites” with references to “personal websites” to try to give the site a clearer focus. I still like the idea of listing small independent websites beyond just personal websites, e.g. the independent B&B sites I mention in London to Orkney, and most of the NC500, in an electric car, but I should get the personal websites listing working really well first, and of course see if there is any interest in extending.

Seeding Link to heading

Pre-loading the search index Link to heading

The original idea behind listing only owner-verified sites was that it would increase the chances of a user finding something useful or interesting. The theory was that if someone cares enough about their site to go through the process of validating ownership, then the chances are that they also care about the quality of the content on their site, and the higher the percentage of quality content in the search index, the better the results are likely to be. The problem with this is that it suffers from a form of cold start problem, i.e. while it might have resulted in a better search in the long term, it was a less useful search (with little content) in the short term. Much of the early feedback reinforced this. The average consumer wants something that is good now, not to help build something that may be better later.

To remedy this, I collected a list of personal websites, and uploaded. The main source was http://www.indiemap.org/, but also some Hacker News (HN) posts1. For the record, 421 sites were loaded in this way (you can use this number to work out how many sites are user submitted, by deducting this number from the total number of sites shown on the Browse page).

As a result, the search is now starting to return some potentially useful and interesting results for certain queries. Admittedly mostly queries for technology terms, which isn’t surprising given most of the currently submitted personal websites belong to people who work in technology. Hopefully personal websites from a greater diversity of backgrounds will be submitted soon. Although, at the risk of stating the obvious, it is still a personal website search, and so is only likely to be useful when searching for things which might appear on personal websites, not a search for news or images or to answer questions like “What is my IP address?”

Allowing site submissions from non-verified site owners Link to heading

Given that non-verified sites have been bulk-loaded, it seemed natural to allow users to submit unverified sites too, so I’ve added a new “Quick Add” option alongside the existing options (which have been renamed “Verified Add”).

The Verified Add gets more frequent indexing, more pages indexed, access to Manage Site to manage indexing (check status, add filters, trigger reindex, etc.), and access to the API. There’s a table with more details on the Quick Add page at https://searchmysite.net/admin/add/quick/.

The wording about what happens after a Quick Add is deliberately a little vague, i.e. “it needs to pass an approval process after submission before it is indexed”, and “The content will be checked to confirm it is a personal site, and if the site passes it will be queued for indexing”. What actually happens is that there is a new Review page where sites are manually approved before being indexed (or rejected and moved to a list of excluded domains), and they will need to be re-reviewed each year to remain.

I think the moderation is an important feature to try to maintain a good level of quality content in the system, which was one of the key objectives of the original approach. One thing that became very apparent during the bulk loading process was the large number of personal sites which hadn’t been updated in years, or simply no longer worked at all, or had even expired and been taken over by domain squatters. Moderation might of course course impact how the site can scale. There are a couple of possible approaches to this. Firstly, any verified site owner could be permissioned for the Review page, so there could be a community of moderators. And secondly, I quite like the idea of trying to develop a classifier to try and automate the process once there is enough training data (although the feature engineering could be particularly challenging).

Scaling Link to heading

Improving the relevancy tuning Link to heading

Now there is a lot more content being indexed, the relevancy tuning becomes much more important. Here’s the latest:

      <str name="qf">title^1.5 tags^1.3 description^1.3 url^1.3 author^1.1 body</str>
      <str name="pf">title^1.5 tags^1.3 description^1.3 url^1.3 author^1.1 body</str>
      <str name="bq">contains_adverts:false^18 owner_verified:true^1.8</str>
      <str name="bf">sum(1,log(indexed_inlinks_count))</str>

The biggest change was to start indexing what I call the “indexed inlinks”, i.e. the pages (from other domains within the search index, not from the same domain or domains which aren’t indexed) which link to this page, and have a boost function (bf) on the number of these. It isn’t quite the same thing as PageRank, because it doesn’t include all links and simply scores on the quantity not quality of the inlinks, but it is a start. I’m using a log function on indexed_inlinks_count just so the results don’t get too skewed, but may need to revisit.

I also have a boost query (bq) for the pages which are owner verified. I don’t want to advertise this as one of the benefits of verifying ownership because it might (incorrectly in my view) sound dangerously close to the concept of “sponsored links”, but I do think it is important to try to keep to the original concept of encouraging content which is more likely to be of a higher quality.

There are a few other tweaks to some of the other numbers based on further testing. I removed is_home:true^2.5 because it was skewing results in ways which can be more difficult to predict, given bq is additive (in contrast to qf and pf which are multiplicative). In many cases a site’s home page will be the most linked to, so the boost function on indexed_inlinks_count should have a similar effect.

Not really relevancy tuning as such, but I also added a must match (mm) with a value of 2, so searches for 2 words would have an implicit AND rather than an OR (although not as a phrase), and searches for 3 or more words would require the presence of at least 2. Based on some of the actual queries users were making it seemed that this would get better results more of the time, although based on experience it isn’t possible to provide a configuration that will satisfy all the users all of the time.

Anyway, as ever with relevancy tuning, it is never really “finished”.

Redesigning the Search and Browse pages Link to heading

Broadly speaking, there are two uses cases for search - targeted searching, i.e. the user has something specific they are looking for, and serendipitous discovery, i.e. the user doesn’t really know what they’re looking for but still hopes to find something useful. On https://searchmysite.net the Search page is intended for the former, and the Browse page for the latter. In the original version, neither were suitable for searching more than a few dozen sites, so they needed redesigning.

For the main Search page, it quickly became apparent that some sites were “drowning out” other sites, i.e. a search for a certain term could lead to an entire page of results from one site, relegating other sites to later pages that users might not reach. The solution to this was to group results by site. As it is now, if there is more than one result from a site, up to 2 additional links from that site is shown, and a link to all the results from that site is provided. This is something that can be fine tuned.

For the main Browse page, simply listing all the sites wasn’t going to scale. The idea as currently implemented is to allow the user to drill into the sites according to various facets, or features, of the sites, to help them find some sites which may be of interest to them. Unfortunately it is questionable how much actual use many of these actually are at the moment. Tags in particular turns out to be particularly disappointing. I know all the Search Engine Optimisation “experts” say not to bother with tags because the major search engines don’t use them any more, but I didn’t think that would’ve stopped so many people from using them. Right now there’s not really enough tags to form a tag cloud. Other possible future ways of browsing could also include some form of visualising the connections between sites, e.g. as per social graph from the http://www.indiemap.org/.

In the light of the slightly disappointing Browse Sites page, I hit upon the idea of a “Newest Pages” link as an additional way for users to discover new sites.

I also finally got to implement the fairer Random Page. When there were only a handful of sites listed, one of the earliest site owners noticed that it was biased towards sites with more pages, because it simply picked a random page from all the pages. So I have implemented a new version which first picks a random domain from all the domains, and then picks a random page from that domain. For example, if there are 500 domains and 25,000 pages, everyone now has a 1 in 500 chance of their site coming up, whereas previously someone with a 10 page site would have had a 1 in 2,500 chance and someone with a 500 page site would have had a 1 in 50 chance.

One final minor point on the redesign that might be worth mentioning is that it is now no longer a completely JavaScript-free site. There is now a small amount on the Browse page. It still works fine with JavaScript disabled though.

Upgrading the servers Link to heading

Although the initial site was able to run fine on a t3.micro, I’ve had to upgrade to a t3.medium to cope with the significantly increased indexing load and index size. I’ve had to experiment with some of the other settings too. For example, I did originally allow verified sites to have up to 1,000 indexed pages, but unfortunately some of the sites with large numbers of microblog entries were taking 3-4 hours to index, and given there could be several hundred or thousand sites being reindexed regularly, I was concerned about how it could scale without potentially expensive hardware upgrades.

Although I can probably afford to self-fund a t3.micro for the time being, I may need to bring forward plans for the “verified listing” fee, especially if load increases.

Conclusion Link to heading

One of the questions at the end of the original write-up was whether users would be more interested in the search for personal websites, or in the search as a service (i.e. being able to place a search box on their own site and configure their indexing). Although I don’t have the definitive answer, and things may change, it is currently looking like there is more interest in the former. Certainly many personal websites, mine included, are fairly small and wouldn’t really benefit much from a search box. But there does seem to be fair amount of interest in discovery mechanisms for federated content. We shall have to see how things go.

I guess the main question now is whether it is worth spending more of my free time on or not. To answer that question, I’ll ease off on enhancements, and try to focus on adoption for a while.