Metasearch engine

From Infogalactic: the planetary knowledge core
Jump to: navigation, search
Architecture of a Metasearch engine

A metasearch engine (or aggregator) is a search tool that uses another search engine's data to produce their own results from the Internet.[1][2] Metasearch engines take input from a user and simultaneously send out queries to third party search engines for results. Sufficient data is gathered, formatted by their ranks and presented to the users.

Information stored on the World Wide Web is constantly expanding, making it increasingly impossible for a single search engine to index the entire web for resources. A metasearch engine is a solution to overcome this limitation. By combining multiple results from different search engines, a metasearch engine is able to enhance the user’s experience for retrieving information, as less effort is required in order to access more materials. A metasearch engine is efficient, as it is capable of generating a large volume of data, however, scores of websites stored on search engines are all different: this can draw in irrelevant documents. Other problems such as spamming also significantly reduce the accuracy of the search.[3] The process of fusion aims to tackle this issue and improve the engineering of a metasearch engine.[4] There are many types of metasearch engines available to allow users to access specialised information in a particular field. These include Savvysearch engine and Metaseek engine.

History

"Why search the web with one search engine when you can search them all - or at least several?" This was the question tackled by researchers following a search engine review that found different search engines to be producing different results because of the different algorithms on which each was based.

The first person to incorporate the idea of meta searching was Colorado State University's Daniel Dreilinger. He revealed SearchSavvy, which let users search up to 20 different search engines and directories at once. Although fast, the search engine was restricted to simple searches and thus wasn't too reliable. University of Washington student Eric Selberg released a more "updated" version called MetaCrawler. This search engine improved on SearchSavvy's accuracy by adding its own search syntax behind the scenes, and matching the syntax to that of the search engines it was probing. Metacrawler reduced the amount of search engines queried to 6, but although it produced more accurate results, it still wasn't considered as accurate as searching a query in an individual engine.[5]

Another metasearch engine was created in May 20, 1996. HotBot, owned by Wired at the time, was a search engine with search results coming from the Inktomi and Direct Hit database. It was known at the time for its fast results and funky name, and as a search engine with the ability to search within search results. Upon being bought by Lycos in 1998, development for the search engine staggered and its market share fell drastically. After going through a few alterations, HotBot was redesigned into a simplified search interface, with its features being incorporated into Lycos' website redesign.[6]

Ixquick is a search engine more recently known for its privacy policy statement. Developed and launched in 1998 by David Bodnick, it is currently owned by Surfboard Holding BV as of year 2000. On June 2006, Ixquick began to delete private details of its users following the same process with Scroogle. Ixquick's privacy policy includes no recording of users' IP addresses, no identifying cookies, no collection of personal data, and no sharing of personal data with third parties.[7] It also uses a unique ranking system where a result is ranked by stars. The more stars in a result, the more search engines agreed on the result.

In April 2005, Dogpile (owned and operated by InfoSpace, Inc. at the time) collaborated with researchers from University of Pittsburgh and Pennsylvania State University to measure the overlap and ranking differences of leading Web search engines in order to gauge the benefits of using a metasearch engine to search the web. Results found that from 10,316 random user-defined queries from Google, Yahoo!, and Ask Jeeves, only 3.2 percent of first page search results were the same across those search engines for a given query. Another study later that year using 12,570 random user-defined queries from Google, Yahoo!, MSN Search, and Ask Jeeves found that only 1.1 percent of first page search results were the same across those search engines for a given query.

Advantages

By sending multiple queries to several other search engines this extends the search coverage of the topic and allows more information to be found. They use the indexes built by other search engines, aggregating and often post-processing results in unique ways. A metasearch engine has an advantage over a single search engine because more results can be retrieved with the same amount of exertion.[2] It also reduces the work of users from having to individually type in searches from different engines to look for resources.[2]

Metasearching is also a useful approach if the purpose of the user’s search is to get an overview of the topic or to get quick answers. Instead of having to go through multiple search engines like Yahoo! or Google and comparing results, metasearch engines are able to quickly compile and combine results. They can do it either by listing results from each engine queried with no additional post-processing (Dogpile) or by analyzing the results and ranking them by their own rules (IxQuick, Metacrawler, and Vivismo).[8]

Disadvantages

Metasearch engines are not capable of decoding query forms or able to fully translate query syntax. The number of links generated by metasearch engines are limited, and therefore do not provide the user with the complete results of a query.[9] The majority of metasearch engines do not provide over ten linked files from a single search engine, and generally do not interact with larger search engines for results. Sponsored webpages are prioritised and are normally displayed first.[10]

Metasearching also gives the illusion that there is more coverage of the topic queried, particularly if the user is searching for popular or commonplace information. It's common to end with multiple identical results from the queried engines. It is also harder for users to search with advanced search syntax to be sent with the query, so results may not be as precise as when a user is using an advanced search interface at a specific engine. This results in many metasearch engines using simple searching.[11]

Operation

A metasearch engine accepts a single search request from the user. This search request is then passed on to another search engine’s database. A metasearch engine does not create a database of webpages but generates a virtual database to integrate data from multiple sources.[12][13][14]

Since every search engine is unique and has different algorithms for generating ranked data, duplicates will therefore also be generated. To remove duplicates,a metasearch engine processes this data and applies its own algorithm. A revised list is produced as an output for the user.[15][16] When a metasearch engine contacts other search engines, these search engines will respond in three ways:

  • They will both cooperate and provide complete access to interface for the metasearch engine, including private access to the index database, and will inform the metasearch engine of any changes made upon the index database;
  • Search engines can behave in a non-cooperative manner whereby they will not deny or provide any access to interfaces;
  • The search engine can be completely hostile and refuse the metasearch engine total access to their database and in serious circumstances, by seeking legal methods.[17]

Architecture of ranking

Webpages that are highly ranked on many search engines are likely to be more relevant in providing useful information.[17] However, all search engines have different ranking scores for each website and most of the time these scores are not the same. This is because search engines prioritise different criteria and methods for scoring, hence a website might appear highly ranked on one search engine and lowly ranked on another. This is a problem because Metasearch engines rely heavily on the consistency of this data to generate reliable accounts.[17]

Fusion

Data Fusion Model

A metasearch engine uses the process of Fusion to filter data for more efficient results. The two main fusion methods used are: Collection Fusion and Data Fusion.

  • Collection Fusion: also known as distributed retrieval, deals specifically with search engines that index unrelated data. To determine how valuable these sources are, Collection Fusion looks at the content and then ranks the data on how likely it is to provide relevant information in relation to the query. From what is generated, Collection Fusion is able to pick out the best resources from the rank. These chosen resources are then merged into a list.[17]
  • Data Fusion: deals with information retrieved from search engines that indexes common data sets. The process is very similar. The initial rank scores of data are merged into a single list, after which the original ranks of each of these documents are analysed. Data with high scores indicate a high level of relevancy to a particular query and are therefore selected. To produce a list, the scores must be normalized using algorithms such as CombSum. This is because search engines adopt different policies of algorithms resulting in the score produced being incomparable.[18][19]

Spamdexing

Spamdexing is the deliberate manipulation of search engine indexes. It uses a number of methods to manipulate the relevance or prominence of resources indexed in a manner unaligned with the intention of the indexing system. Spamdexing can be very distressing for users and problematic for search engines because the return contents of searches have poor precision.[20] This will eventually result in the search engine becoming unreliable and not dependable for the user. To tackle Spamdexing, search robot algorithms are made more complex and are changed almost everyday to eliminate the problem.[21]

It is a major problem for metasearch engines because it tampers with the search robot’s indexing criteria, which are heavily relied upon to format ranking lists. Spamdexing manipulates the natural ranking system of a search engine, and places websites higher on the ranking list than they would naturally be placed.[22] There are three primary methods used to achieve this:

Content Spam

Content Spam are the techniques that alter the logical view that a search engine has over the page's contents. Techniques include:

  • Keyword Stuffing - Calculated placements of keywords within a page to raise the keyword count, variety, and density of the page
  • Hidden/Invisible Text - Unrelated text disguised by making it the same color as the background, using a tiny font size, or hiding it within the HTML code
  • Meta-tag Stuffing - Repeating keywords in meta tags and/or using keywords unrelated to the site's content
  • Doorway Pages - Low quality webpages with little content, but relatable keywords or phrases
  • Scraper Sites - Programs that allow websites to copy content from other websites and create content for a website
  • Article Spinning - Rewriting existing articles as opposed to copying content from other sites
  • Machine Translation - Uses machine translation to rewrite content in several different languages, resulting in illegible text

Link Spam

Link Spam are links between pages present for reasons other than merit. Techniques include:

  • Link-building Software - Automating the search engine optimization (SEO) process
  • Link Farms - Pages that reference each other (also known as mutual admiration societies)
  • Hidden Links - Placing hyperlinks where visitors won't or can't see them
  • Sybil Attack - Forging of multiple identities for malicious intent
  • Spam Blogs - Blogs created solely for commercial promotion and the passage of link authority to target sites
  • Page Hijacking - Creating a copy of a popular website with similar content, but redirects web surfers to unrelated or even malicious websites
  • Buying Expired Domains - Buying expiring domains and replacing pages with links to unrelated websites
  • Cookie Stuffing - Placing an affiliate tracking cookie on a website visitor's computer without their knowledge
  • Forum Spam - Websites that can be edited by users to insert links to spam sites

Cloaking

This is a SEO technique in which different materials and information are sent to the web crawler and to the web browser.[23] It is commonly used as a spamdexing technique because it can trick search engines into either visiting a site that is substantially different from the search engine description or giving a certain site a higher ranking.

SavvySearch Engine

Savvysearch engine is a metasearch engine that is capable of analysing other search engines and making independent decisions on whether or not to interact with them. Savvysearch engine is selective in order to provide information that is useful for users. It was created to address two goals: to expand the chance of returning reliable links, and to reduce the consumption of web resources. Savvysearch can find out how reliable a particular source is by sending parallel queries to search engines. It is also capable of monitoring an engine’s performance in the long run to determine how compatible and how well the particular search engine responds to a particular query.[24]

Metaseek Engine

Metaseek engine is a metasearch engine that deals with image retrieval. Metaseek engine is able to select and conduct analysis of various image search engines. The selected engines are monitored and ranked based on the ability to provide relevant images in relation to the user's query, and a score is applied accordingly. The scores are indexed to retain references for new search queries. Metaseek engine also takes extra caution to ensure that the images are available in the database by storing visual features for every single image. The scores of the queried performance are used as a reference for selecting future search engines to use with new queries. When a query is made to Metaseek engine, the engine will go through its files to match the corresponding image with the most suitable search engines. If an image is not available in the database, a similar image will be selected instead. This is achieved through analysing vectors for common characteristics to find the closest possible match to the original. When results are collected from search engines, they are merged and provided to the users.[25]

See also

References

  1. Sandy Berger's Great Age Guide to the Internet by Sandy Berger. Que Publishing, 2005, ISBN 0-7897-3442-7.
  2. 2.0 2.1 2.2 Lua error in package.lua at line 80: module 'strict' not found.
  3. Lawrence, S. and Lee Giles, C. (2014). Patent US6999959 - Meta search engine. [online] Google Books. Available at: http://www.google.com/patents/US6999959 [Accessed 20 Oct. 2014].
  4. The collection fusion problem. By E. M. Voorhees, Narendra K. Gupta, and Ben Johnson-Laird. NIST SPECIAL PUBLICATION SP, 1995, 95-95.
  5. Lua error in package.lua at line 80: module 'strict' not found.
  6. Lua error in package.lua at line 80: module 'strict' not found.
  7. Lua error in package.lua at line 80: module 'strict' not found.
  8. Lua error in package.lua at line 80: module 'strict' not found.
  9. Lua error in package.lua at line 80: module 'strict' not found.
  10. http://www.au.af.mil/au/, (2014). 42. [online] Available at: http://Intelligence Exploitation of the internet [Accessed 31 Oct. 2014].
  11. http://www.timeatlas.com/term_to_learn/general/metasearch_engines_expands_your_horizon#.VH1BGYvF_ng
  12. MENG, W. (2014). Metasearch Engines. [online] http://www.cs.binghamton.edu. Available at: http://www.cs.binghamton.edu/~meng/pub.d/EDBS_Metasearch.pdf [Accessed 20 Oct. 2014].
  13. Lua error in package.lua at line 80: module 'strict' not found.
  14. Lua error in package.lua at line 80: module 'strict' not found.
  15. Patel, B. and Shah, D. (2014). RANKING ALGORITHM FOR META SEARCH ENGINE. [online] http://www.technicaljournalsonline.com. Available at: http://www.technicaljournalsonline.com/ijaers/VOL%20II/IJAERS%20VOL%20II%20ISSUE%20I%20%20OCTBER%20DECEMBER%202012/231.pdf [Accessed 21 Oct. 2014]..
  16. Lua error in package.lua at line 80: module 'strict' not found.
  17. 17.0 17.1 17.2 17.3 M, M. and Jacob, E. (2014). Information retrieval on Internet using meta-search engines: A review. [online] http://nopr.niscair.res.in. Available at: http://nopr.niscair.res.in/bitstream/123456789/2243/1/JSIR%2067(10)%20739-746.pdf [Accessed 27 Oct. 2014].
  18. Lua error in package.lua at line 80: module 'strict' not found.
  19. Manmatha, R. and Sever, H. (2014). A Formal Approach to Score Normalization for Meta-search. [online] http://maroo.cs.umass.edu. Available at: http://maroo.cs.umass.edu/pdf/IR-242.pdf [Accessed 27 Oct. 2014].
  20. Al-Kabi, M., Wahsheh, H. and Alsmadi, I. (2014). An Online Arabic Web Spam Detection System. [online] http://thesai.org. Available at: http://thesai.org/Downloads/Volume5No2/Paper_16-OLAWSDS_An_Online_Arabic_Web_Spam_Detection_System.pdf [Accessed 24 Oct. 2014].
  21. Najork, M. (2014). Web Spam Detection. [online] http://research.microsoft.com. Available at: http://research.microsoft.com/pubs/102938/eds-webspamdetection.pdf [Accessed 23 Oct. 2014].
  22. Vandendriessche, G., Altius, and Brussels, (2014). A few legal comments on spamdexing. [online] http://www.altius.com. Available at: http://www.altius.com/media/pdf_files/spamdexing.pdf [Accessed 22 Oct. 2014].
  23. Wang, Y., Ma, M., Niu, Y. and Chen, H. (2014). Connecting Web Spammers with Advertisers. [online] http://www2007.org/. Available at: http://www2007.org/papers/paper111.pdf [Accessed 24 Oct. 2014].
  24. Howe, A. and Dreilinger, D. (1997). SAVVYSEARCH: A Metasearch Engine That Learns Which Search Engines to Query. AI Magazine, [online] 18(2), p.19. Available at: http://www.aaai.org/ojs/index.php/aimagazine/article/view/1290 [Accessed 26 Oct. 2014].
  25. Beigi, M., Benitez, A. and Chang, S. (2014). MetaSEEk: a content-based metasearch engine for images. http://proceedings.spiedigitallibrary.org/.