Search

By the Book

Month

October 2015

Article Summary for Lecture #10 – Hariri

“Relevance Ranking on Google: Are Top Ranked Results Really Considered More Relevant By the Users?”

Nadjla Hariri’s article examines the differences between relevance as determined by users and search algorithms. He begins the discussion with a brief description of Google’s ranking methods. To sort by relevance, Google uses a patented algorithm called PageRank, which considers more than 500 million variables and 200 billion terms in sorting pages. It also factors in ‘votes,’ which are cast for a page when another page links to it, making a page more likely to be considered as relevant. From his research, Hariri wanted to find out whether the relevance ranking produced by this complex algorithm matched what users really found relevant to their lives. Additionally, noting that users rarely go past the first or second page in skimming results, he wanted to know whether the results on later pages are actually less relevant than those displayed at the beginning. The article proceeds to define ‘relevance’ as it is used in the research: “the closeness of retrieved documents to queries on the basis of the subject matter or other specified attributes of the record” (599), and to demarcate what records would be found on which page. Next, Hariri launches into his literature review. Some of the studies cited compared various search engines to each other, producing a variety of contradictory conclusions as to what engine produced the most relevant results. Another batch of studies discussed tested whether or not search engine rankings matched up with user choices and or expectations (this occurred very rarely). Notably, most of these studies judged relevance using human judgement. The article then explains the methodology used: thirty-four different students from various backgrounds utilized Google for thirty-four different searches. The students were then given the first four pages of their hits and told to mark those which were most relevant, relevant, and irrelevant. Hariri took this data and calculated precision ratios from it. The results found that often the document users most relevant was ranked fifth by Google, with Google’s most relevant ranking placing second. Almost all students surveyed found either the first or the fifth item to be the most relevant. Generally, the highest precision indicated was on page one, followed by two, four, and then three. He does note that even in the fourth page there were at least three hits judged relevant by 40% of the users, some of which were just as good quality as those in earlier pages. Hariri ends with his conclusions. This study upheld previous studies which showed that even the best search engines aren’t as good as they might first appear to be. Because of this, searchers should look on later pages of their search results even if they weren’t able to find good material on the first couple of pages, as there may be relevant results a little further back.

This article is easy to understand, and describes an experiment with a rational methodology and believable results. Overall, it seems like a useful and reliable source.

Advertisements

Article Summary for Lecture #9 – Mai

“Analysis in Indexing: Document and Domain Centered Approaches”

Mai’s paper presents the reader with the idea of a domain-centered approach to indexing, as opposed to the traditional document centered focus. Indexing, as Mai uses the term, is the process of discovering the subjects of documents, then converting that into index terms which will be retrieved by searches. There are two different foci in indexing: document and user oriented. Document oriented indexing pulls subject information only from the document, and does not take any context surrounding the document into account. A slightly modified version of this is the document centered approach, which pulls from the document but does bear possible user queries in mind. Different from both of these, however, is the user centered approach, which focuses primarily on what will make the document meet user needs. The domain-centered approach increases this focus, basing its analysis of documents on its understanding of users. Mai notes that the simplest types of indexing are done in two steps: first, analysis of the document to find out the subject matter, then a translation of subjects into index terms. This study is directed towards making the first step clearer. With this goal, the piece continues on to a discussion of standards versus guidelines, stating that indexing must be somewhere between the two, as there can be standardization in the expression of subjects, but only recommendations in the determination of those subjects. Indexing guidelines point indexers towards sources of subject information, but don’t say exactly how to get information from them. As Mai points out, even if a book title (or table of contents, blurb, etc.) is pretty precise as to the book’s contents, there are several places it can be placed in the Dewey Decimal System, and it falls to indexers to decide where. This launches him into a discussion of changes in textual analysis theory, and the idea that textual meaning is formed by the reader and his or her context. He offers two definitions of context (the objectified and interpretive) and suggested both play a part in the indexer’s determination of domain. This finally brings the article to concentrate on domain analysis, which believes that the organization and representation of data should start with the analysis of the data’s context, which is contained in the domain. He notes that disciplines and specialties are too broad to functions as domains, which should rather be defined as “a group of people who share common goals” (606), such as a group of public library users. Once the indexer has examined the boundaries and needs of the domain, he or she is able to express the document’s subject in terms of the needs of the domain. Although the document centered approach is more stable, Mai feels that the most useful subjects can be derived from using the domain-centered approach.

Mai’s points are interesting and have some strength to them, although I very much doubt that the domain-centered approach will ever be put into effect. As it can only effectively be done locally, and it seems likely the subjects would have to be redone occasionally to keep up with changing user needs, the expense would be enormous. Apart from that, the article is well written, if inclined to be somewhat repetitious.

Article Summary for Lecture #8 – Gross, Taylor, & Joudrey

“Still a Lot to Lose: The Role of Controlled Vocabulary in Keyword Searching”

This piece is written in answer to a commonly expressed sentiment in library circles: since lots of patrons use only keywords for searching, surely we can get rid of expensive and inflexible subject headings and rely on keyword searching only? By no means, reply Gross, Taylor, and Joudry, and back up their answer with an exhaustive mountain of studies. Like many other studies, the 2005 precursor to this article found that search results were reduced by over a third when run without subject headings. Here, many more studies and articles are surveyed in depth. The articles examined were generally either very much for abolishing subject headings in favor of keyword searching, or of the opinion that controlled vocabulary is a necessary part of successful searches. This pro-controlled vocabulary body of literature found that keyword searching, although sufficient for quick queries, needs to be supplemented by controlled vocabularies in order to serve the needs of researchers. Many authors also noted that lots of hits from keyword searches are often irrelevant. A number of studies found that subject headings were important accessing non-textual materials, as well as sources in a variety of specialized fields of study. The section concludes with a list of the most common solutions offered by the literature reviewed: use both keyword searching and controlled vocabulary (many found these systems complementary); import user tagging to help with searches (although these can be helpful, they bring a set of problems involving bias and irrelevant tags); take user search terms and use them to augment the controlled vocabularies; create tools to help inexperienced users utilize controlled vocab; and to include metadata such as tables of contents and summaries that have more words to enrich keyword searching (although this tends to decrease accuracy of hits). This study used the same research question as was the center of the previous work: What proportion of records retrieved by a keyword search would not be retrieved if there were no subject headings? However, this time it also looked at: What proportion of records retrieved by a keyword search has a keyword only in a subject heading field in a catalog enriched with TOCs and summary notes? and What proportion of records retrieved by a keyword search has a keyword only in a subject heading field when the results are not limited to English? The authors explain their methodology, which includes two searches for each topic in order to prune results, and the limitations of the study. There was no data about foreign-language results from the 2005 survey, making comparison impossible. Finally, the article discusses the findings of the search. The results revealed that 27.7% of hits would be lost without subject headings, higher in the foreign language fields. The TOC enhanced results were better, but still inferior to the subject heading results. The article then discusses possible future research, and concludes.

We read the often mentioned precursor to this article in one of my other classes. I found this study to be a thorough and satisfying follow-up from the previous study, as the authors support their points with abundant data and careful methodology.

Article Summary for Lecture #7 – Taylor

“On the Subject of Subjects” by Arlene G. Taylor

“On the Subject of Subjects” offers extensive insights into the field of subject cataloging as it stood in 1995. From the beginning, librarians have been inclined to ignore or look down upon subject cataloging and searching, despite the fact that catalog use statistics have shown subject searching to be one of the most common methods of information-seeking. Certainly the Internet is the site of many, many subject searches, although its efficacy is hindered by the lack of trained librarians sorting and organizing it. However, many librarians are beginning to create guides and organizational tools to help combat the chaos that is the Web. Taylor offers examples of this, both in professionally organized and freelance contexts. One such endeavor (tremendously optimistic, from our later view) involves cataloging useful websites with standard MARC records and URLs, adding the records into the OCLC. The article also touches on standards work being done (including such efforts as Core records, Dublin Core, and TEI Headers), which will determine the future of subjects. This leads to a discussion of controlled vocabulary, which some commentators believe have been made obsolete by keyword searching. Taylor disagrees with this, pointing out that although keywords may work for a casual searcher looking for any kind of result, they are often nearly useless for dedicated researchers with very specific needs. This also relates to the problem of specific entries, and being able to provide entries specific enough for researchers to find what they are looking for even while the ways in which subjects are referred to continue to change. Also mentioned is the controversy regarding classification, and how it is to be adapted from a physical to a Web setting. After touching on these topics, Taylor continues on to a lengthier discussion of OPACs. OPACs’ subject searches often provide either too many or too few results, stemming from users’ unfamiliarity with subject headings, the problem being exacerbated by the ways in which results are displayed. The article discusses various research avenues being explored to correct these flaws, such as improving relational structures within catalogs and routing zero-hit searches through actions likely to produce some type of results. Finally, the article closes with a discussion of Library of Congress Subject Headings, noting a variety of recommendations which have been made to help make LCSH easier for users to navigate.

Taylor certainly is competent, and her article provides a thorough look at the forces that were shaping the future of subject searching and subject headings at the time of her writing. The topics touched on are also interesting to regard in retrospect, seeing how some of the issues have since been addressed, or remained in a somewhat altered form. But the article is terribly dated, as it is partially written about the rapidly changing world of the Internet. I think that a newer article about the current state of the subject heading would be much more beneficial to include as a reading.

Blog at WordPress.com.

Up ↑