ECAI banner

 

Text Analysis and Pattern Detection

The Blue Dots Project
NSF Highlights
December 2009

In The Blue Dots Project, Professor Lewis Lancaster, ECAI, University of California, Berkeley explores the design of high dimensional visualizations, analyzing text structure and patterns for humanities scholars.  Multi-dimensional interactive visualizations are a normal component of scholarly processes in the sciences and engineering. These tools are highly specialized to support the research processes of specific disciplines. This project adds a complementary capability by developing a visual, ergonomic methodology for interactive search, retrieval, browsing, and analysis within large text corpuses.  Linked to the specific needs of humanities researchers it is however not limited to the features of individual languages.  The methodology supports corpus browsing, searching, pattern identification, interactive interrogation, and seamless linkage to ‘witnesses’ – digitized texts and images of original manuscripts.

The corpus used for this demonstration project includes 160,465 pages of rubbings from printing blocks carved in the 13th century in Korea. Search capabilities include faceted searching of any combination of the following:

  • Key words – up to ten words and/or phrase entries. Provides “proximate” search capability for words near one another but not necessarily contiguous. The range of distance between words can be set from 1 to 62.
  • People  - names of translators, authors, compilers.
  • Place - names of monasteries where translations and/or compilations were made. Since translators often moved from place to place, multiple place names may be searched simultaneously. Results may be imported into Google Earth for geographic display.
  • Time – A time slider bar can be set to limit search results.

Results are displayed in a dimensional environment where users can view and navigate within the image. The image uses both the abstracted form of a “dot” as well as color to inform the user of the information being retrieved. Each “Blue Dot” represents one glyph of the dataset. Alternate colors indicate position of search results. The use of color, form, and dimension for a fast understanding of the information is essential for large data sets where thousands of occurrences of a target word/phrase may be seen. Multiple tools to manipulate the 3-D space allow a variety of interrogation methods. A navigable 2-D grid representing each text shows where the terms appear in the 1504 individual texts that make up the canon. Individual result elements are linked to: catalogue entries; text lines of each occurrence; digitized full text pages; and scanned images of the rubbings.

Potentially Transformative Research
Use of the tools developed by The Blue Dots Project can transform research on large corpuses of text. The entire scholarly process is incorporated into this new research environment including: initial hypothesis; empirical testing; data collection; results capture; and storage of results for conclusion dissemination.   The ability of researchers to handle text data on the order of 100 MBs and scanned images of originals, which are 25 GBs in low resolution, in this fast and intuitive manner did not previously exist. This environment can be readily expanded to become a multi-lingual toolset for any collection of text data.

Intellectual Merit
The identification of patterns and outliers in the occurrences and location of words within a text corpus are often crucial factors in deciding the relative importance of one of the texts in approaching research questions.  Using an abstracted 3D model as developed in The Blue Dots Projects in the search for word strings, clustering of terms, automatic analysis of Ring Construction, then viewing results by time, creator, and place will provide textual scholarship with a new approach and new insights into the structure, history, and evolution of texts.

Broader Impacts
The visualization design developed for this project is intended to be intuitive for both researchers and learners. It includes capture of search results to encourage scholarly collaboration.  The Blue Dots Project interface will be adopted by the Atlas of Chinese Religions, which has geo-registered more than 30,000 monastery sites, including those related to the project corpus. A plan to explore in detail the possibilities of using this system within an immersive virtual reality environment is being explored with The New Media Art School at City University of Hong Kong.

Accompanying Images

BlueDot1

1) 3-D Text Corpus View.  Each character in the text is represented by a single ‘Blue Dot’. Planes of ‘Blue Dots’ represent individual printing blocks. Search results are highlighted in contrasting colors. Multiple navigation methods allow users to interrogate the space.

BlueDot2

2) Place Name Search Result. In the 3-D Display, areas of ‘Red Dots’ indicate documents translated or compiled at this location. The Geo-Display shows the location of each Monastery found in the data results mapped in Google Earth

BlueDot3

3) 2D Text Corpus View. In the 2D matrix, each pixel represents one of 1504 text instances. Colors indicate search result locations in the complete corpus. This feature enables users to quickly navigate to sections of interest in the 3-D environment.

BlueDot4

4) Links to Original Texts.  Each occurrence of a key word search result is displayed as a ‘Red Dot’ in the 3-D corpus view.  Each of these ‘Red Dots’ is linked to the original line of text in the digitized canon and to scanned image of the printing block rubbing. Highlighted links in the text allow browsing to a reference dictionary.

Return to: Blue Dots Website