Friday, April 6, 2012

IMPURE@VGSoM: Tutorial for beginners


Impure is an online application which empowers people to be part of the information revolution. It is a powerful tool to gather, combine, analyze and deeply understand data in the Internet. You can work with your own data or with many sources available online, such as news feeds, social media streams, real time or historical financial information, search results, images and many more.
Impure's modular interface lets you design information flows with ease, linking data sources to operators, controls and visualization methods within a graphical interface that clearly displays the structure of your process. In this way, it helps enabling even non-programmers to work with information in a professional way and to explore complex bodies of data.
Among other possibilities, impure allows you to:
§  easily read data from diverse sources and repositories
§  load your own data locally or remotely
§  visualize it in a wide range of ways (more than 100 visualization methods so far)
§  process it.. compare it... mix it.. filter it... (more than 300 controls and operations so far)
§  publish and share your projects

Using Impure is easy and intuitive; you don't need to type any code. All is done by linking modules together to set up information flows, that begin with feeds or other data inputs and end with processed data or visualizations. In between, you can set up interactive controls to let users choose or modify parameters dynamically and see results change in real time. Often the visualization modules themselves can be used as controls and feed the whole process back to enable exploration.
Impure has been conceived as a flexible tool that contributes to the democratization of the information age for all Internet citizens, and turns the Web into an unlimited resource for the generation of insights and knowledge in your preferred area of interest.

Basic Functionality


Everything in Impure is done by modules. Different kinds of modules are able to load, generate, analyze, convert, filter and visualize data. Other ones are controls that let users interact with the process. Many modules perform several of these operations at once.
Let's take a closer look at the components of a module:

Inlets and outlets

All modules have one or more inlets: these are the ports through which they get data from the outside world to do something with it. Some inlets are required, meaning the module will do nothing and sit idle until you connect these ports to some source. Others are optional, and give you further control over the operation(s) the module performs, but they have default values that will be used if you don't set one yourself.
Many modules (but not all of them) do also have an outlet, which make the result of the module's process available, so you can use it to feed other modules and thus set up your processing chain. One outlet can feed as many inlets as you wish, enabling you to process or visualize the same data in many different ways.
Both inlets and outlets offer a convenient contextual help that will show up when you hover your mouse over them. There, you will find the corresponding data structure (in red), a short description (in black), and a string representation of the data that is coming in or out of that port, if any.
If you are used to a text-based programming environment, you can think of the inlets as the arguments of a function or method, and of the outlet as its return value.

Direct input

Some types of data structures give you the possibility of typing the values you wish directly into the inlet itself, saving you the need to place other modules on the stage for that purpose. These inlets are identified by a small arrow in the bottom right of the icon. There is also a gray rectangle to the left of the inlet: just click on it and start typing the input value.
After you click in any other part of the space, or do nothing for a few seconds, your input will automatically be converted into a Data Structure module of the appropriate type.


Modules can be very powerful, but they will do nothing by themselves. Impure comes to life only when you link them together to define an information flow, that begins with some data source and typically ends with one or more visualizations or processed outputs.
Defining connections is easy. You just need to hover on an outlet, click on the purple circle that appears, drag the connecting line to your destination module, and release it in the appropriate inlet.

After the connection is established, it will be visible as a line with an arrow tip at the middle that shows the flow direction. If data is flowing through that channel, the line will be red. Otherwise (if the source has a void or null value) it will be yellow.

Module Categories:

Modules in Impure can do many different things. They are organized into five broad categories according to the function they perform.
Knowing which type of module you need in a particular situation is the first step to find your way around Impure's libray and to be able to build spaces quickly and easily.
In this section you will find an explanation for each of these types and what they are useful for.

Data Structures

Data Structures are identified by the color RED 
Data Structures hold a piece of information of a given type. They have no inlets and only they have one outlet, which is the source from where you can read the data contained inside that "box".
There are many different kinds of Data Structures, but only some of the most basic ones can be placed as modules in an Impure space. You will be able to recognize them in the Library by the small arrow in the bottom right corner of their icons.
Those are modules you can drag into the space, and type information directly on them.

Draggable (aka typable) Data Structures are the ones that can be defined by typing or pasting a text, such a Number or a NumberList. Many Data Structures modules are not allowed to be placed on stage; that's because is not possible (so far) to define its content only by typing. In the future more Data Structures will be typable (once we define a text code for them).
Untypable Data Structures exist only as inlets and outlets. Why are there in the Library? Because Data Structures are the basic pieces and it is very important to have always access to its entire list and documentation.
If you want to place on the space a Data Structure that's not typable there are ways to do so. For instance, StringList is not typable, but you can place a String, write a text using a separator character, and then using the splitString in order to build the desired StringList.


Operators are identified by the color CYAN
Operators always have at least one inlet. They perform some kind of operation on the data that is fed to them, and return the result in the outlet.  
The operators Library is the most populated (there are more than 300 operators so far)


Visualizators are identified by the color MAGENTA
Visualizators build some kind of visual representation from the data they receive. There are many options available for any kind of data you can manipulate within Impure. Different visualizations can reveal different aspects of a certain data set. Many visualizators also allow interaction, giving users the possibility to dynamically explore the data. Some also have an outlet that makes the result of that interaction available to feed other processes in the space.



Controls are identified by the color ORANGE
Controls let users interact with the space or perform some complex task (such as downloading data).


Apis are identified by the color GREEN.
Apis allow communication with many sources of information on the Internet. Some of the most frequently used ones are: Google search, Twitter search, Twitter word historical behavior, Market data, Flickr search, Flickr sets loader, Delicious account data load, Ebay items information, Dictionary definitions, Semantic expansion, etc.


They are the access point for Impure's library, which is a set of lists containing all the modules that are currently part of the application. There are many of them! Finding exactly what you need might not be easy when you are just getting started, but there are several resources to help you with that task, and you will soon get used to obtaining what you are looking for quickly and effectively.
There's one broad classification to start with: according to the type of process they perform, modules belong to one of five module types. Each of them is identified by a distinctive color, which you can see on the left border tabs. Click on any of them to unfold the corresponding list. For example, these are the Operators:




How To


Quickly bring data to impure

There are three main ways to bring data to an impure space:

Using an API module

Using an API is a quick and easy way to obtain rich structured information from the Internet. For instance, you can load links from a Delicious account, all the images -with their tags- from a Flickr set, the historical market behavior of a company, the occurrences of some word on Twitter during the last month, etc.
There are many api modules in Impure and we plan on keep adding new ones all the time.

Using generator modules

Some operators, which you can find under the tag "generator", build complex data structures from simple parameters. For example, you can choose the number of nodes and relations, and instantly get a random network, or a NumberList filled with a given number. Generators are useful to perform quick tests.

Loading data from files

Perhaps you have tables in excel files or csv format, or some text you want to analyze. You can load the file into Impure using FileLoader module, and then decode it using the appropriate method. For example, if you loaded a text file in .csv format you can pass the output of FileLoader to the decoder csvToTable.
For the specific (and very common) case of .csv, you could also use csvLoader, which does everything in a single step. CSV is a text format that encodes tables: Database and spreadsheet software, such as Excel, can usually export .csv with ease.

Quickly visualize data

If you have already placed data into an Impure space, chances are that you can visualize it immediately. Usually you just need to bring the appropriate visualizator to the space and connect it to your data.
The choice of visualizator naturally depends on the type of data you are working with. Some visualizators require more than one data structure.
Let's take a look at some examples of typical choices of visualizators for different kinds of data structures.

NumberList → Histogram


Two NumberLists → ColorScatter


NumberTable → SimpleNumberTableVisualizator

Network → Oracle

Use internet search to obtain valuable data

Let's take a specific look at a powerful technique for getting information from the Internet, which we call multi-search.
We are all used to conduction searches in search engines, such as Google. They return a list webpages with occurrences of the string you typed. But sometimes you want to obtain more information than is present on any single page, or you are not interested in a specific word or idea, but in the relations among a set of concepts.
Imagine you want to find out about the similarities and differences between Internet browsers - not a specific pair, but all of them. We would usually search for a web page posted by someone who has already invested the time and effort necessary to do the comparison. We need to be lucky. It is certainly much more likely that a lot of people would have done comparisons for specific pairs. If we could ask Google to return all the pages in which any two browser are compared, we would obtain very more valuable information. But, how to do that? It would be great if we could type something like this:

“[browser_name] compared with [browser_name]"
“[browser_name] is faster than [browser_name]”
Quotation marks are important because they guarantee the search will be strict, meaning Google will only return pages in which the complete sentences are found. Once your StringList is ready, just connect it to InternetMultiNSearchResults. Just watch as the module performs the searches one after another, and populates a NumberList with the amount of results for each of them.


Loading data from Excel

Open a Excel file. It should be a simple spreadsheet: a table with rows and columns. It may have a headers row at the top. Export this table to a CSV file through Excel's "Save as" function.
Now, create a new Impure space and place the control csvLoader in it.
Click on the input box of the first inlet and type the system path to the file you just created. You can also enter the url of a file that has been uploaded to Internet, as we are doing in this example. If your data does NOT have column titles in the first row, you can set the second inlet in the control to 'false'. By default, csvLoader will consider the first row as headers.

As soon as you have entered a valid path or url, csvLoader will start loading it and giving you feedback on the progress, in case it is a large file and takes a while to load. It will turn green when it is finished.

Voila! Your Excel data is already available within the Impure space to do whatever you want with it. Pay attention the List and Table operators: you will find many ways of filtering, sorting, analyzing and combining tabular data.
Before processing the data in any way, though, you will probably want to see it and check everything is all right. Just plug in the TableVisualizator, and you will have a convenient representation with scrollbars for panning around large tables. This module does also let you click on a cell to select it, making its contents available in the outlet.

Draw maps


Drawing a Simple Map Around the Empire State Building

First let's see how we can draw a simple map of a location using the Google API in Impure. For this we will use the GeocoderGoogleMapsapi and the GoogleMapVisor
First we need to find the GEO coordinates of the center of our map. For example, let's say that we want to draw a map around the Empire State Building in New York. We pass this address as a String to the GeocoderGoogleMaps. In order to check the output we will use a TableVisualizator (Note: we set the optional parameter "column's width" of the TableVisualizator to 300 to see the entire string of the columns).
Now we need to use the GoogleMapVisor to draw the map. As in put, we need to pass the first cell of the Table returned by GeocoderGoogleMaps; we do this using a getElementFromTable (with input 0,0). We use a zoom level of 16 and a map type of 1 which draws the buildings in 3D.


Drawing a Marker on the Map

Now let's draw a marker in this map. We do this by superimposing, on the map that we have, a drawing of a polygon at the desired coordinate. We do this using a Polygon2DSimpleVisualizator. Let's discuss each of the input parameters coming to this module:
1. The first input to this visualizator is a polygon for the Marker. Since our marker is a single point, we use a listAssembler to construct a polygon with the single point coming form the GeocodeGoogleMaps module. 2. The bounding rectangle for this visualizator should be the same as the rectangle outputted by our existing GoogleMapVisor. 3. The radius of the marker is given by a NumberList (one for each point in the polygon; in our case there is a single point but we still need to create a list).
This gives us the following image:
If you now move the new visualizator exactly over the map, and then go to the view (eye) menu (top left of Impure) and tick off "visualizator panels", you will be able to see the map, with the marker superimposed!
Notice that if you move the map (dragging with the mouse) left and right, the marker moves correctly. However if you move up and down the marker gets misaligned. This is because the GoogleMapVisor uses mercator projected coordinates, whereas the Polygon2DSimpleVisualizator uses standard geometric coordinates. We will fix this in the next section.


Geo coordinates in Google are returned using the Mercator coordinate system, which is different form the normal Impure coordinate system, for several reasons: the cover the surface of a sphere (the Earth) rather than a plane, and zero designates the equator. In order to represent properly in Impure a coordinate given in Mercator we need to use a universalProjectionOnTransformationGeo module. Here is an example:
We will need to use this technique to transform both the Geo coordinate of the place being marked and the coordinates of the Rectangle given by the GoogleMapVisor before passing them to the Polygon2DSimpleVisualizator. If we do this we will finally obtain a properly marked map which can be moved:
(We cheated a little bit in the image above: we painted the marker red instead of the default grey. You can do this providing a color to the 'color' parameter of the Polygon2DSimpleVisualizator. To learn more about colors you can see color

Multiple Markers

Once we have this schema for drawing a single marker, it is extremely easy to extend it to multiple markers. All it takes is to change the GeocoderGoogleMaps to a GeocoderMultiGoogleMaps and to provide multiple addresses! Here is the final schema (we highlighted in blue the changes from the last one, to emphasize how little we changed!).

Link for Data used for making examples:
Presentation for Impure:
<div style="width:425px" id="__ss_12300974"> <strong style="display:block;margin:12px 0 4px">
<a href="" title="Impure data analytics &amp; visualization tool" target="_blank">Impure data analytics &amp; visualization tool</a></strong> <iframe src="" width="425" height="355" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe> <div style="padding:5px 0 12px"> View more <a href="" target="_blank">PowerPoint</a> from <a href="" target="_blank">Shubham Gupta</a> </div> </div>

Link for presentation: Impure Presentation

No comments:

Post a Comment