Frequency weighted sampling in SQL
Let's say we have a table with some large number of rows and we want a randm sample of them. The easiest approach to this is to use something like
NEWID() as follows:
SELECT TOP 100 * FROM t1 ORDER BY NEWID()
This works greats if all rows are considered equal. Let's say, however, that we have a column that gives each row a different weight or importance. The earlier approach is indiscriminate, will ignore those weights, and result ina much more tail-skewed sample.
Instead, we can use something like this:
SELECT Query, QueryCount FROM ( SELECT TOP 2000 * FROM t1 ORDER BY RAND(CAST(NEWID() AS VARBINARY)) * 1/QueryCount ) x ORDER BY QueryCount DESC
We change the
RAND() seed for each row otherwise it'll give the same value for every row. The inverse frequency will tend to put higher weighted rows closer to the top, subject to the randomization.