4. Social Network Analysis

In recent years, the analysis and modeling of networks, and also networked dynamical systems, have been the subject of considerable interdisciplinary interest in science and innovation studies. Innovation is the result of the interaction among an network of organisations in the public and private sectors whose activities and interactions initiate, import, modify and diffuse new technologies, and the term ‘innovation system’ is used to emphasize this. In this lesson, we turn to social network analysis as a tool to map the network properties of developments in science and innovation.

Core concept: proximity

Boschma (2005) argues that the importance of geographical proximity cannot be assessed in isolation, but should always be examined in relation to other dimensions of proximity that may provide alternative solutions to the problem of coordination.

  • Design a model of innovation based in the proximity thesis. Give an argumentation of your choices.
  • Design indicators of innovative performance according to your model. Give an argumentation of your choices.
  • Collect data and provide interpretation. What do the values of the indicators mean for the outcomes of your model?

Additional literature: The new science of networks?

Hanneman and Riddle  (2005) provide an online introduction to social network methods (published in digital form at http://faculty.ucr.edu/~hanneman/ ). Social network analysis views social relationships in terms of network theory consisting of nodes and ties (also called edges, links, or connections). Nodes are the individual actors within the networks, and ties are the relationships between the actors. The resulting graph-based structures are often very complex. There can be many kinds of ties between the nodes. Research in a number of academic fields has shown that social networks operate on many levels, from families up to the level of nations, and play a critical role in determining the way problems are solved, organizations are run, and the degree to which individuals succeed in achieving their goals.

Alternatively, the book ‘Networks, Crowds, and Markets: Reasoning About a Highly Connected World‘ by David Easley and Jon Kleinberg provides an excellent overview of different scientific perspectives to understanding networks and behavior. Drawing on ideas from economics, sociology, computing and information science, and applied mathematics, it describes the emerging field of study that is growing at the interface of all these areas, addressing fundamental questions about how the social, economic, and technological worlds are connected

In recent years, the analysis and modeling of networks, and also networked dynamical systems, have been the subject of considerable interdisciplinary interest, yielding several hundred papers in physics, mathematics, computer science, biology, economics, and sociology journals, as well as a number of books. Watts reviews the major findings of this emerging field and discuss briefly their relationship with previous work in the social and mathematical sciences (Watts 2003).

Please discuss (6 slides) the idea of social network analysis. Explain important concepts.

Core empirical analysis: Gephi

Gephi is an open-source software for visualizing and analysing large networks graphs. Gephi uses a 3D render engine to display graphs in real-time and speed up the exploration. You can use it to explore, analyse, spatialise, filter, cluterize, manipulate and export all types of graphs.

On the Gephi website you’ll find several tutorials. https://gephi.org/users/

Please carry out the Quick Start Guide, the Tutorial Visualization, and Tutorial Layouts. 

Gephi uses CSV files (comma-separated values) with its data laboratory. Gephi expect that each row of the file is a node or an edge. Note that the import can be done at any moment, the workspace does not need to be empty. You should know some general aspects of the import process:

  • Only the columns that you want will be used, except for mandatory columns (source and target of edges) that can’t be unchosen
  • If a column title already exists in the workspace you will be able to use it but the data type of the column can’t be changed, and imported data will be parsed to fit the existing column type
  • Data for each row and column will be parsed to the given/existing column data type only when it is possible, if it is not possible for a cell it will be set a null value

You can choose what table to import the rows to (nodes or edges), but the behaviour and needs are a bit different and you should make sure what options are selected before executing the import process.

Also, we are going to learn how to adapt the import process to our needs choosing the options that the import wizard allows to indicate. Use the ‘Import CSV’ button in Data Laboratory as seen in the picture

Then a wizard will open and you will have to choose some generic, table independent options.

Gephi also offers several METRICS that are often used in social sciences to indicate how well a node is connected, as well as other network statistics. Can you make a nice visualisation of the following collaboration network (right click>save as) and provide some centrality measures and interpretation? This .net file can be read into Gephi directly by right click>open with Gephi. Gephi calculates degree centrality, betweenness, closeness, density, and path length.

Networks of words: Co-word analysis

IN the last decades, as the scientific literature has increased dramatically, scientists found it increasingly difficult to locate needed data, and it is increasingly difficult for policymakers to understand the complex interrelationship of scientific content in order to achieve effective research policies. Some quantitative techniques have been developed to ameliorate these problems; co-word analysis is one of these techniques. Based on the co-occurrence of words, co-word analysis is used to discover linkages among subjects in a research field and thus to trace the development of science. Within the last decades, this technique, implemented by several research groups, has proved to be a powerful tool for knowledge discovery in databases.

We will now focus on relations among texts based on (co-)occurrences of words. Several authors (e.g., Callon et al., 1990; cf. Leydesdorff, 1997) have focused on so-called “co-words,” but this technique unnecessarily restricts the analysis to dyads. One may also encounter tri-words or higher-order co-occurrences. Additionally, the sharing of a single word may be a meaningful connection among texts. Dyads are thus a special case. Technically, we can construct a co-occurrence matrix (of dyads) from an occurrence matrix, but not vice versa because single occurrences, for example, have then been lost.

 Semantic map for the titles of Gaston Heimeriks.

Making a word-occurrence network

First, we need to obtain a data set from the ISI Web of Science. Make a search that results in around 500 records, for example ts=(aquatic and biomass and (algae or seaweed)). Download these records and important in a new database using the SAINT ISI data importer. Please note that records can be downloaded max 500 at a time. You may wish to import queries that you constructed in another database.

An occurrence matrix is a matrix of cases (units of analysis) and variables as we know it from SPSS. The cases in our analysis are documents and the variables are attributes to these documents like words, citations, addresses, journal names, author names, etc. You have created such a matrix a previous time for the case of author names, but we haven’t yet studied this matrix in more detail.

Author names are already provided in the database which was downloaded from the Web-of-Science, but in the case of words we first have to compose a list. Scientometric data offer several opportunities for constructing word occurrences; authors provide a short list of keywords, but also the titles and abstracts can be used for this purpose. Lets first construct a frequency list of word occurrences using abstract words using the following query.  To generate a ‘couple-words-articles’ table from the abstracts, you need to use the SAINT word-splitter.

This tool allows you to select the table that contains the field you wish to analyse; in our case the abstracts. New tables are automatically generated in your database. The new table containing the word-id’s and article-id’s can be used to generate a frequency list.

You’ll see that  we have a problem: the most frequently occurring words are so called ‘stopwords’ (the, of, and, in, by, etc.) that we don’t consider meaningful. For example, search engines routinely ignore these words in their search queries. Fortunately  stopword lists can easily be found on the internet. Copy a stopword list in excel, and import it into your Access database.

Using this imported stopword table, Access has an easy to use query wizard to ‘subtract’ two tables from each other. Under ‘create’, ‘query wizard’ select the option ‘find unmatched query wizard’. This query allows you to create a new table (words without stopwords) with all stopwords removed.

Please generate a new abstract word frequency list without the stopwords.

For our analysis we need a list of words as variables and the titles are the cases. Our output matrix should look like this and contain zeros and ones:

Word1 Word2 …. …. …. …. …. Word n
Text 1
Text 2
Text n

In order to construct this matrix we need to take the most frequently occurring words (minus stopwords). In the word-frequency query that we constructed before we can introduce a threshold value under criteria in the design view. Choose a threshold value that results in no more than 100 words by typing >[threshold value] (e.g. >3) in the ‘criteria’ row under count. Using this query, we can now  construct a word-article table.

 

This table can be exported as an excel file (by right click>export>excel). In excel (or Access), an Pivot table (under the insert tab) can be created of words over articles (as shown above). Use the ‘count’ function and not ‘sum’ to construct your co-occorrence matrix! You may want to copy/paste values to be able to edit the pivottable.  For example, the PIVOT table includes rows and collumns representing the ‘blank’values and the ‘grand totals’ that you do not want to include in your analysis.

In SPSS you can compute the cosine matrix by going to Analyze > Correlation > Distance > Between variables > Similarity > Cosine. Make sure that all empty cells have a zero value! For example, by using the replace option in Excel.

What we end up with is a term vector or vector of terms and frequencies. Even though the vectors will have as many terms as they do important words they can be visualized like regular two dimensional vectors. All the same rules apply in higher dimensionality as do in the simple two dimensional world.  And that means that if you plot two vectors you can tell how close they are by calculating the distance between the two end points.  To make the measurement even easier we only look at the angle between them.

Here we can tell that d1 is more like q than d2 by noting the angles between the vectors. A shorthand for comparing the angles is to compare their cosines.

Cosine similarity

The resulting matrix of cosine similarities can be represented in the DL-format (data definition language) which social network analysts use, as shown below. One can copy the output table in SPSS, and paste in excel (because copying from SPSS directly into a text file tends to be problematic).

Open notepad and past all information in the DL format shown below. First paste a list of labels (e.g. words) and then paste the matrix with only the cosine values, without the collumn headers and row labels. (Here (right click) is an example of such a file, and here a gif of such a file (to preserve the format)).

DL
N=[fill in the number of words]
FORMAT = FULLMATRIX DIAGONAL PRESENT
LABELS:
[label1]
[label2]

[label n]
DATA:
[your matrix]

For example:

DL
N=5
FORMAT = FULLMATRIX DIAGONAL PRESENT
LABELS:
GAVEV
KOTUN
OVE
ALIKA
NAGAM
DATA:
0 1 0 0 0
1 0 0 0 0
0 0 0 1 0
0 0 1 0 0
0 0 0 0 0

Save this file as a (ASCII) text file and read this into Pajek. Draw the picture. Set the lines at different width. You can remove the unconnected words by partitioning the file (in Pajek 2.x  Net > Partition > Core > All) and then Operations > Extract from Network > Partition > 1-*. The unconnected words are in the partition with number zero and thus removed. The lines are set with different width and with size 2, fonts with size 10, and vertices size 8, as defined on the input file. Note that one can edit the input file or edit the partitions under File > Partition > Edit.

After importing this file into Pajek, the results may be a bit disappointing because all the relations have values. (The cosine runs from 0 to 1.). Go to the main menu in Pajek2 and under Net > Tranform > Remove > Lines with values lower than 0.2 . Redraw.

Update: To find the same functionality in different versions of Pajek please refer to Changes in Main Menu Structure from Pajek 2.05 to Pajek 3.** and Pajek 4.**

 You can save the file in .net format and read into Gephi. Please provide a nice visualisation. Can you provide an interpretation of the results?

We now visualized the cosine distances among words. Please note, there is no necessary relationship between co-occurrences in the observable network of relations, and distances derived from co-occurrence patterns. The co-occurrence patterns can be mapped using the cosine values among the distributions, whereas the values of co-occurrence relations can be visualised directly. In the latter case one visualises the network of observable relations, whereas in the former one visualizes the latent structure in this data. Two synonyms, for example, may have (statistically) similar positions in a semantic map, but they will rarely co-occur in a single title.

Additional empirical analysis.

Can you provide another interesting word map based on original data?

The words-articles matrix that we made before (the pivot table), can also be used to visulaise the relationships between words directly. Save the pivot table as an ASCII text file as shown below and read this into Pajek.

DL
NR=4, NC=4
FORMAT = FULLMATRIX DIAGONAL PRESENT
ROW LABELS:
Text1
Text2
Text3
Text4
COLUMN LABELS:
Word1
Word2
Word3
Word4
DATA:
34 0 0 4
0 520 21 35
0 25 100 7
2 74 30 432

Draw the picture. Set the lines at different width. You will see that Pajek reads an asymmetrical (or 2-Mode) matrix like this one as a double matrix. Please explain why.

One can distinguish cited and citing in Pajek by using Net > Transform > 2-Mode to 1-Mode. You can save the 1-Mode network as .net file that can be read into Gephi.

 References

Easley, David, and Jon Kleinberg. Networks, crowds, and markets: Reasoning about a highly connected world. Cambridge University Press, 2010.

Hanneman, Robert A., and Mark Riddle. “Introduction to social network methods.” (2005).

Watts, Duncan J. “The” new” science of networks.” Annual review of sociology(2004): 243-270.