Getting Related

Getting Related

I have got my lifestream up and running on Gregarius. I still plan on using Profilactic though. It will be more effective at updating the stream multiple times a day. I could put a cron job on the Gregarius installation but it would have to run often because Twitter is in the stream.

What I do plan on using it for is to serve related content for my sites by serving the feeds of search results. It's a hacked way to do it, but it will work until something better comes along. But the question is how do I get the keywords to create the query.

I have run a few tests. The results are that there is more I need to learn. I have been using MySql fulltext search to pull related data directly from a database. This was before the Gregarious installation. You see the system is not only for related items from feeds but also from affiliate datafeeds.

Pull the tags from the post. Run a fulltext search on the content in the database. And come back with about a 25% success rate on matching. I have no idea why this happens. It would seem that either I am using the wrong seed from the content or that I am using the fulltext search incorrectly. Fulltext can also be slow if there is a lot of content to process.

I thought about the Yahoo API but it just serves up random tags and sometimes a lot of them. What if I only wanted two or three? How will I determine the top two or three words out of a whole post?

Or everything in the system could be tagged dynamically and then a comparison of tags run to find related items? The percentage of similar tags would indicate the possibility of the topic being simlar. The intersection of tags would create a web model of the knowledge in the database rather than the linear one of traditional search results. It seems that I remember Scuttle doing something like this. Once the system got huge though, the databases joins keep using up system resources.

Another way would be to use a structured set of topics, like Wikipedia to create tags. Each item in the database could be grouped around Wikipedia pages.

An old module for PHPnuke found related topics by taking the title of the current post, removing the stop words and then picking a random word from what was left to query the database with. It worked pretty well but I have to think of all possibilities and with the two word titles that some of my posts have, I don't think this one is going to cut it.

A lot of what I have seem from semantic search is not really that innovative. Most of the words pulled out of a post in this way are capitalized to begin with. Think of the objects it maps: People, Places, Companies, etc. Is there anything here not in title case? I knew that was a trigger a long time ago but I knew it only covered a few things. If I look back through this post so far, I see Gregarius, Profilactic, Yahoo API, Scuttle and Twitter. Pulling those words out of this text may pull up some related items but this still is limited.

One way to get rid of some of the noise is using stop words. These are words that search engines and indexing algorithms don't index. They are common words like a, an and the. I thought maybe I could take this one to the extreme.

If the most common words are so common that they can't really indicate anything. Then the most uncommon words must be real indicators. So if I check the concentration of keywords in a body of text, the words that occur least frequently in the text may be heavy indicators. Or I could have a table of the average word occurrences for the English language as a whole to compare against.

As you get closer to the core of a topic, more jargon and technical speak is used to describe it. It's the more general posts that will be hard to peg. But as the least frequent words start bubbling to the top, I would think that there would be a point where the topic match would be locked.

I thought about spam and then I realized it would be a closed loop. Nothing gets in the system that's not let in. But tracking could help. Clicks and traffic could indicate a lock in the matching and this would help make the system more dynamic, adjusting to environment factors. It would also help to choose the best matches when a lot of possible matches are available.

In would also seem that as the system absorbed more content, there would be a possibility that it could perfect itself. If there were 10 articles on Adwords in the database, the keyword pool would be less than if there were 100 articles on Adwords. The matches could be finetuned by similar keyword occurences and the 100 articles on Adwords could be broken down into subnetworks by keyword occurence. I used subnetworks instead of subcategories because I am still looking at a web rather than a line.

And there you have it. Again a list of possibities with no answers. It's what I call thinking before the box. Get everything out that I can think of on a subject, then hit the books. It sets my goal in my mind before I dive into the math. It also gives me a chance to have beginnners luck, that magical state of mind before you become an expert and then have to follow all the rules. I have already found a few PDF's and Books on the subject and if you thought this post was boring. Why am I taking this extreme? For one, if I had no chance to learn something new, I wouldn't be doing it in the first place. Two, I also like studying the mind and as I delve deeper in this subject, I am learning much more than code.

So, when I was finished with this, I decided it was good to have examples of what I am talking about. The two blocks in this post were created using the SimplePie Wordpress plugin. One uses a feed of a tweetscan search. The other uses a Dapp feed of a complex del.icio.us search. Both are cached for 30 minutes so they will evolve and show new links every 30 minutes. This is close to what I am looking to do but with a full lifestream instead of seperate feeds.


Stephan Miller

Written by

Kansas City Software Engineer and Author

Twitter | Github | LinkedIn

Updated