MMI Lesson 1: Scientometrics

Scientometrics is the science of measuring and analysing science. Modern scientometrics is mostly based on the work of Derek J. de Solla Price and Eugene Garfield. The latter founded the Institute for Scientific Information (ISI) which is still heavily used for scientometric analysis. Methods of research include qualitative, quantitative and computational approaches. Related fields are science and technology studies, innovation studies and sociology of scientific knowledge. Journals in the field include Scientometrics and Journal of the American Society for Information Science and Technology. Also many publications in the field of innovation studies (e.g. in Research Policy, TASM, TFSC) are based on scientometric analyses.

Please discuss two interesting innovation related articles using scientometric network analyses. Please specify what constitutes the nodes and the relationships in these analyses.

Paul Wouters provides further reading for those interested in the emergence of the Science Citation Index. Wouters describes in his dissertation (http://www.garfield.library.upenn.edu/wouters/wouters.pdf) how scientrometrics (the measurement of science) emerged in the sixties with the invention of the Science Citation Index (SCI) in Philadelphia (United States). Using the SCI, it took far less work to extract citation frequencies from the data. It became even possible to measure the frequency with which an individual was cited, a feat previously unheard of. The emergence of the database had many unexpected consequences. The Science Citation Index is not merely a bibliographic instrument. It also creates a new picture of science via bibliographic references found in scientific literature. As the Terminology & Definitions section of the SCI explains: The Citation Index is an alphabetic list of references given in bibliographies and footnotes of source articles arranged by first author. Each reference is followed by brief descriptions (citations) of the source articles which cite it. In this way, the SCI provides a fundamentally new representation of science.

Systematic

In this section we examine ways to analyze the data we have collected in terms of systems of relations. Why do we need relations? Authors may be related to different titles and titles to different authors. Thus, networks of relations can be spanned. A common measure of such relations is the extent to which papers cite the same previous papers. This is called bibliographic coupling. Similarly, co-citation is the configuration that a paper is cited by—rather than citing from—other papers.) Later on in this course, we will look at social network analyses in more detail. Other types of relations are co-words (Courtial – Coword analysis of scientometrics) and word-reference co-occurrences (Van den Besselaar and Heimeriks 2006). For further reading see Boyack – Co-Citation Analysis, Bibliographic Coupling, and Direct Citation: Which Citation approach Represents the Research Front Most Accurately?

Please discuss these different types of relational analysis.The discussion should consist of 6 slides. 

In order to perform analyses, we need to organize our data in a relational database. A relational database matches data by using common characteristics found within the data set. The resulting groups of data are organized and are much easier for many people to understand. For example, a data set containing all publications in a field can be grouped by the year each publication was published, the country of origin, the topics, the cited references, journal names, author’s last name and so on. Such a grouping uses the relational model (a technical term for this is schema). Hence, such a database is called a “relational database.”

There are several good tutorials for MS Access 2010. In this course we will focus mostly on constructing queries in MS Access. A “query” refers to the action of instructing the database to return some (or all) of the data in your database. In other words, you are “querying” the database for some data that matches a given criteria.

At the bottom right side of the screen, you will have noted the option “Output Records”. Here you can save records – at a maximum of 500 at a time – for further processing. The ISI freeware programs are available at http://www.leydesdorff.net/software/isi/index.htm which allows the user to organize these files into “relational database management.”

Alternatively, the Rathenau Institute has developed SAINT, which stands for Science Assessment Integrated Network Toolkit. This is a set of tools for bibliometric and patentometric research, including a parser program for the ISI/Web of Science downloaded bibliographic data. Currently, the only supported database backend are Microsoft Access files.

SAINT can be used for;

1.Turning raw data into a relational database
2.Cutting titles and abstracts into separate words for analysis
3.Building queries to answer your questions
from simple statistics
to complex patterns

Run SAINT installation, select ISI data importer
Select your input .txt files
Select output database (use MS Access initially; MySQL databases for file sizes over 2Gb)
Run
Examine database in Access

 

 

The ISI data importer tool displays three tab pages. On the first page, you can select the input file or files that you want to import. Type or copy/paste the name of the file into the box, or click on the file selector button () to display a file dialog. From the file dialog, you can easily select multiple files (located in a single directory) at once. The last ten selected files will be stored, so you can easily access them again using the drop-down list. Click on the little arrow inside the box for the file name to display the previously used files.

Once you have selected the files that contain the raw ISI data, change to the Output tab. Here, another file selection box is presented. Use this Output file box to select which file you want to use to output your data to. Note that this file does not have to exist yet. If you enter or select a file that does not exist yet, that file will be created.

 

Queries are used to interact with the data in a database

  • search and select (e.g. all articles where country is Netherlands)
  • combine and compare (merge two tables; calculate similarity)
  • statistical analysis (count, sum, average, etc.)

Please discuss briefly some Standard queries (What information do they provide?);

  • Count of Articles per year and per country
  • Bibliographic coupling
  • Co-authorship
  • A query that couples articles to authors only
  • A query that couples articles to research addresses only
  • A query that couples authors to keywords

When you build queries, always link tables using unique identifiers

Use numbers as identifiers:
matching on numbers is MUCH faster than matching on strings
use the type long integer when you make your own

SAINT produces unique identifiers for all items
When you make your own data (e.g. harmonised author names) make sure to add a unique identifier

Scientific publications can be harvested from the Web-of-Science of the Instituteof Scientific Information(Thomson) at http://www.isiknowledge.com/ . You can access this address directly or through the digital library of Utrecht University.

Go to “advanced search”. Let’s search for some authors in the Innovation studies group at UtrechtUniversity(see http://www.geo.uu.nl/phpscripts/staffpages/index.php?group=33) Select some of the authors (among the Dr.’s) and use the following type of search;  “AU= hekkert m* OR heimeriks g* or van lente h* OR alkemade f* OR Farla J* OR Herrmann A* or Negro s* OR Peine A* OR Van Rijnsoever F*”. You may wish to include “AND address =Univ Utrecht”. If you click on them, you can inspect them one-by-one. They are organized with the most recent papers on top of the list. Scroll down and find one with citations. Click on it and study the layout of the record. As you see, you can click on the cited and citing references. What is the difference between these two?

Go back to the listing. On the right side is a screen that enables you to “Analyze Results” and to make a “Citation Report”. The citation report, for example, informs you about the development over time. The picture raises questions. Can you formulate one? The tab “Analyze Results” allows you to generate distributions. Make a distribution of the authors in this set.

In a next step we will now download the records in order to proceed with more options for the scientometric analysis. To that end, enter the total number of records (1 to 100+) in the third option under “Output Records”. Then click on “Add to Marked List”.

Enter the “Marked List” at the top of the screen and save the records to file after tagging all the fields that may be of interest to us in a later state. (Take them all.) The computer now saves the full records (as plain text!) and thereafter you can save them in a folder as “data.txt”.

The text file can be parsed with the ISI-parser. An access database will be created, with all ISI information organised in a relational database. Several standard queries are automatically generated. I used this text file (right click).

Please generate the lists of;

  • Most popular journals in which IS scholars publish
  • Most frequently cited references
  • Most frequently cited journals
  • Most frequently used keywords
Can you provide an interpretation of these results?

In order to get these results you have to construct queries. A query can be created in ACCESS by pressing the  create tab. Choose the option of QUERY design.

You are asked which tables contain the information that you would like to have combined in your QUERY. Highlight the tables that have the relevant data then click add. The tables will appear in a grey window and then click and drag the relevant sections from each table that you want to appear.

The journals are listed in the ARTICLES table. The names are listed in the JOURNALS table. You have to create a relationship between the two tables by clicking on Journals-ID in ARTICLES and drag to ID in Journals.

First you select the field ‘Journal-ID’ from the ARTICLES Table, after that you select ID. In order to make a count, one has to press the Sigma (totals) button on the top menu bar. Select ‘group by’ COUNT in the ID column.

A similar QUERY can be constructed using the tables COUPLE-ARTICLES-CITEDREFERENCES (field CITED REFERENCES) and CITEDREFERENCES (count of ID).

You need three tables in your QUERY to get the most frequently cited journals;

The database also allows us to construct a co-author network. Gephi requires two input files (in csv or text format): best is semicolon-delimited csv-files.
one for nodes:something-nodes.csv (a list of all unique nodes in the network with an ID)
one for edges : something-edges.csv (a list of all relationships between the nodes)

 

How to import these SQL queries into your database: Create -> Query Design -> Close -> View SQL (top left button) -> Copy-paste the SQL into the screen -> Save and Run

 

Co-authorship Edges

Query: Co-authorships

 

Co-Authorship Node IDs

SELECT [AUTHORS_ID_1] FROM [Co-authorships] UNION SELECT [AUTHORS_ID_2] FROM [Co-authorships];

Co-Authorship Nodes

SELECT [Co-authorships-Node IDs].AUTHORS_ID_1, Authors.[full name] FROM [Co-authorships-Node IDs] INNER JOIN Authors ON [Co-authorships-Node IDs].AUTHORS_ID_1 = Authors.ID;

Co-Authorship Nodes to csv

SELECT [Co-authorships-Nodes].AUTHORS_ID_1 AS ID, [Co-authorships-Nodes].[full name] AS Label FROM [Co-authorships-Nodes];

Right-click on the query -> Export to text file -> Select folder and filename -> Select Delimited -> Delimiter: Semicolon and Text Qualifier: <none> -> Finish

Co-Authorship Edges to csv

SELECT [Co-authorships].Authors_ID_1 AS Source, [Co-authorships].Authors_ID_2 AS Target, “undirected” AS Type, [Co-authorships].[Number of co-authored papers] AS Weight

FROM [Co-authorships];

 

ð  Right-click on the query -> Export to text file -> Select folder and filename -> Select Delimited -> Delimiter: Semicolon and Text Qualifier: <none> -> Finish

 

You can export all ACCESS tables and QUERIES as CSV, Excel or TXT files by Right-clicking..

Nodes
ID;Label;AttributeA;AttributeB
1;John Smith;4;1.5

Edges
Source;Target;Type;Weight;AttributeA;AttributeB

1;4;undirected;4;1.5;3

•Go to tab Data Laboratory
•Select Window > Context
this will allow you to see how many nodes and edges have been imported
Gephi still has trouble importing large numbers of edges: ALWAYS CHECK!

Can you provide a visualisation of the co-author network (OPTIONAL)? Please provide an interpretation of the network structure.

Use the ‘Import CSV’ button in Data Laboratory for the Nodes file (first!) and the Edge file (second)