Category Archives: Text Mining

Comparing Network and Visual Tools

My last three posts highlight a particular tool for visualizing data. Each site has its own strengths and weaknesses, so choosing which one to use depends upon what outcome the user wants to show.

Voyant is a useful tool for use in conjunction with text mining. It will use the same dataset used for that practice to show key trends within the information. It can provide users with the common words and phrases used. The visuals provided with Voyant are line graphs that show these words and phrases throughout the dataset.

CartoDB uses geographic information within datasets to display the information on a map. This allows users to see the geographic relationships within the topics of their dataset. The various types of maps it can produces will highlight different aspects of the information.

Palladio, while it does have a map feature, relies on its ability to visualize how pieces of information provided related to others within it. The main output is in the form of word maps, which can complement the line graphs that Voyant provides.

Each tool allows users to gain a particular insight into the information. By seeing the information displayed visually, rather than textually, researchers are able to see various relationships that may not come across through reading the material. Seeing which words are used most often can provide the common language of the time, while seeing a map of where events took place can show some of the biases within the information.

Choosing which tool to use boils down to a couple key questions. The first is what the dataset includes. If there is no geographic information, CartoDB has little to no use. But if there is a long history within the dataset, Voyant can track word usage and vocabulary trends over time. And finally, if there are a wide variety of topics that can be compared with each other, Palladio may be the best option.

There is no rule saying only one tool can be used, and if the dataset has enough information all three can be utilized effectively to highlight different aspects of it.

Working with Voyant

Voyant is a text analysis tool that allows the user to bring in a large number of text documents and manipulate the content in order to gain a better understanding of the material. There are two things that Voyant creates – a word map and graphs. Word maps are a collection of words in a cluster, with the more common words sized larger than less common words. Graphs will allow the user to plot the usage of specific words throughout the sources entered into the site. Both of these method allow the reader to gain insight into word usage within the texts. Word maps and graphs can be exported at any time, and users are given the option to save them as a URL, HTML page, or static image.

After uploading text documents, Voyant will immediately populate a word map in the top left. It will include all words in the document, so one key step is needed in order to gain the most knowledge from the map. The this step is to eliminate common words such as ‘and’, ‘the’, etc. This is done under the ‘Stop Words’ tool in the summary window. Clicking on the gear will bring up the tool. There are pre-built lists that users can use, and can input their own additional words if needed. Once this is completed, the word map should update without those words.

In order to create a graph, users can click on words either in the word map or in the summary box directly below it. The window titled ‘Words in Entire Corpus’ allows users to check multiple words, which will then appear in the graph in the top left of the screen. If more than one text document has been uploaded, it is possible to use the words that appear within more than one document. Clicking through the various documents in the ‘Keywords in Context’ window in the bottom right will allow the user to choose which documents Voyant will search. This is helpful for comparing the use of these specific words. The graphs allow the user to show the usage of specific words over time, allowing for analysis of changing trends within the selected documents.