    • CommentAuthorMark Hahn
    • CommentTimeJun 5th 2010
    Like many others I want to detect which visitors to my site are legit search bots, which are legit browsers, and which are bad bots. The XML file here is awesome and gives all the information needed for this task.

    Are these long strings in the XML file provided always seen as the exact same strings? Don't they vary? If they do then I don't see a direct way to use the long strings to really separate the three classes of visitors I mentioned.

    I'm used to short strings like 'AltaVista', 'fangcrawl', etc. For example, it would be a lot easier to work with "UnChaos" rather than "<a href=''> UnChaos </a> From Chaos To Order Hybrid Web Search Engine.(".

    Does anyone provide a short name version of these strings? If not, I was thinking of banging out a quick piece of software than takes the XML file as input, crunches, and returns the minimum size strings to differentiate between the three classes. If I do I would be happy to provide this utility here.

    Am I on a reasonable path here or should I just take the lists of short words I find in google searches instead?