5. Evolutionary Models of Science and Innovation

Evolutionary models in science and innovation have focused mostly on the issue of changes in technology and routines. If the change occurs constantly in the economy, then some kind of evolutionary process must be in act, and there has been a proposal that this process is Darwinian in nature. Then, mechanisms that provide selection, generate variation and establish self-replication, must be identified. Variation may be pre-structured by selection, but nevertheless one can expect variation to be changing more rapidly than selective structures. Selection is deterministic (determined by the structure of the selecting system), while variation introduces randomness (exploration). 

Core concepts: product space

The ECONOMIST writes; “ECONOMIST Ricardo Hausmann, of Harvard Univeristy, and César Hidalgo, of the Massachusetts Institute of Technology, have just released the absorbing (and very visually appealing) Atlas of Economic Complexity. It builds on their earlier work which we wrote about here. Mr Hidalgo is a physicist who applies his knowledge of networks to economics.”

The fundamental proposition of the book is that the wealth of nations is driven by productive knowledge. Individuals are limited in the things they can effectively know and use in production so the only way a society can hold more knowledge is by distributing different chunks of knowledge to different people. To use the knowledge, these chunks need to be re-aggregated by connecting people through organizations and markets. The complex web of products and markets is the other side of the coin of the accumulating productive knowledge.

They find that the measure of productive knowledge or capabilities they infer from the product space, which they call the Index of Economic Complexity, is highly predictive of growth.

In fact, it beats measures of competitiveness such as the World Economic Forum’s Global Competitiveness Index by a factor of 10 in predicting growth for the following decade. It also beats by similar margins measures of human capital and governance.

Their model shows how the products a country makes today determine which products they will be able and likely to make tomorrow, through the evolution of their capabilities.

This is especially interesting from an investment point of view. The authors write in another paper:

Traditionally, economic development has been measured through a host of aggregated variables, mainly GDP, adjusted for PPP. Yet, as a concept, development had always been associated with an increase in diversity that cannot be captured by such averages. As the human body develops, cells differentiate into neurons, muscles, bones and other cell types. Similarly, as nations develop, different industries and products are born. Assessing the health of a nation solely based on its wealth is like assessing the health of a child solely based on its weight.

  • Design a model of the innovative performance of a country or a firm based on the product space model of Hausmann and Hidalgo. Give an argumentation of your choices. You can either analyse a single country or do a country comparison.
  • Design indicators of innovative performance according to your model. Give an argumentation of your choices.
  • Collect data and provide interpretation. What do the outcomes of your model mean?

Additional literature: evolutionary concepts in social sciences

Of paramount importance to the natural sciences, the principles of Darwinism, which involve variation, inheritance, and selection, are increasingly of interest to social scientists as well.

In Darwin’s Conjecture, Geoffrey Hodgson and Thorbjørn Knudsen reveal how the British naturalist’s core concepts apply to a wide range of phenomena, including business practices, legal systems, technology, and even science itself. They also critique some prominent objections to applying Darwin to social science, arguing that ultimately Darwinism functions as a general theoretical framework for stimulating further inquiry. Social scientists who adopt a Darwinian approach, they contend, can then use it to frame and help develop new explanatory theories and predictive models.

What insights can you offer on the  difficulties and opportunities of using evolutionary models of variation, inheritance, and selection, in social sciences in general and innovation sciences in particular?

Core empirical analysis: Technology space

Technology classes inform us about the technologies a patent relies on, and as a patent can have anywhere between 1 and 80 technology classes assigned, the patent combines those technologies in a single invention. Conceptually, the frequency two particular classes occurring together on a patent, thus can indicate the relatedness of the patents. In other words, using patent data it is possible to calculate the relative distance between any pair of technology classes.

In bigquery, we can easily query for the number of co-occurrences.

SELECT t1.ipc_subclass_symbol AS class_1, t2.ipc_subclass_symbol AS class_2, count(distinct t1.appln_id) AS co_occur

FROM [innometrics-1055:patentdata.priorities] t0

INNER JOIN EACH [innometrics-1055:patentdata.tls209_appln_ipc] t1 on t0.appln_id = t1.appln_id

left JOIN EACH [innometrics-1055:patentdata.tls209_appln_ipc] t2 on t1.appln_id = t2.appln_id

WHERE t1.ipc_subclass_symbol != t2.ipc_subclass_symbol

GROUP BY class_1, class_2

We ask the database for all classes in table t1,  (where the appln_id also occur in table t0 (priorities) to prevent duplicates). Next, for each class in table t1, we take the appln_id’s coupled to the class, and we ask which ipc classes are associated with those appln ids in table t2.

(Table t2 is the same table as table t1, but you cannot ask Bigquery to link a table to itself, so we pretend it is another table for this query by giving it two aliases: t1 and t2)

If we wouldn’t use the count(…) function and the GROUP BY  command, the results would show all individual matches, which clearly is too much. By asking for the  COUNT(DISTINCT appln_id) BigQuery will count every unique patent application with both class_1 and class_2 associated, hence a co-occurrence table.

This data can be used to calculate a relative technological distance (or proximity) for each pair of  classes. Different measures can be used to calculate such a distance, e.g. Salton’s cosine or the Jaccard index. The jaccard index quantifies the technological relatedness as follows.

Where  S is the technological relatedness between technology L and S,  C denotes the number of times technologies  L and S occur together in a patent, and L and S denoting the occurrence of respective technologies. This method has proven its propriety in similar analyses by many others (Boschma, Heimeriks, & Balland, 2014; Gerken & Moehrle, 2012; G Thoma, Torrisi, Gambardella, & Guellec, 2010; Grid Thoma & Torrisi, 2007).

So in addition to the co-occurrence, this measure needs individual occurrence values as well. A very simple query will give us the desired data:

SELECT t1.ipc_subclass_symbol AS class, count(distinct t1.appln_id) AS occur

FROM [innometrics-1055:patentdata.priorities] t0

INNER JOIN EACH [innometrics-1055:patentdata.tls209_appln_ipc] t1 on t0.appln_id = t1.appln_id

GROUP BY class

ORDER BY class


Again, the appln_ipc table is linked to the priorities table to only use data from priority patents.

The jaccard index can be calculated in R, however as you might have discovered, the result of the first query returned a table of approximately 250.000 rows, which is more than allowed for the csv export.

There are a few ways to extract the data. It is possible to connect r to the database using the package ‘BigRQuery’, this however requires to have the data on your own google account and that involves setting up a billing account etc. It is also possible to split the results in parts of maximum 16000 rows per part by adding a WHERE or HAVING clause that limits the number of results (e.g. [HAVING class_1 LIKE “A%”] to select only results where class_1 starts with A)

This is all a lot of trouble so we have exported the file for you. (it is in google drive, in ‘Datafiles’, named ‘ipc_co_occurrence.csv’) https://drive.google.com/folderview?id=0B3YXebfRfhx3fnRLNTA0d1ZqeDNpWkhkclNtR2lVNlN6MEFzOU1DSENTRUs4VTl6a3NJT2c&usp=sharing

This file can be imported in RStudio as done before (make sure to check ‘Yes’ at ‘Heading’ when importing the file.

Also export the results of the last query and import in RStudio. Be sure to name the dataframes ‘ipc_co_occurrence’ and ‘ipc_occur’ respectively when importing in R, so they will be compatible with the scripts.

Open the file ‘Tutorial_technology_map.R’ in are studio. This file contains scripts and instructions to generate a (gephi) network file from technology co-occurrence data.

Two notes about importing this file

  • It is advisable to apply a filter to the edge-weight in order to make the graph less ‘overconnected’.
  • You can copy the node ‘names’ to the node ‘labels’ column in gephi to show the ipc classes in the network.

Create a nice technology map, where you Identify some clusters.

Now remember the IPC technology portfolio of firms. If we rewrite the query a little, we get this:

SELECT t1.ipc_subclass_symbol AS class, count(distinct t1.appln_id) AS occur

FROM [innometrics-1055:patentdata.201502_HAN_PATENTS] t0

INNER JOIN EACH [innometrics-1055:patentdata.tls209_appln_ipc] t1 on t0.appln_id = t1.appln_id

WHERE HAN_id = 540673

GROUP BY class


If you download this table and import it into gephi, as a node table, you can see it adds a column ‘occur_DAF’ to the data table. The newly imported data is at the bottom of the nodes table.

If you clikc ‘More actions’ >  detect and merge duplicates

You can merge the new data by setting:

  • base column to detect duplicates: Name
  • for column ‘occur_DAF’ select ‘Join values with a separator.

Now you can use this new attribute to color nodes by partition ‘occur_DAF’ to visualise DAF’s technology portfolio on the technology map.

If you want to do this for a certain time frame we can add a appln_filing_year filter:

SELECT t1.ipc_subclass_symbol AS Name, count(distinct t1.appln_id) AS occur_DAF

FROM [innometrics-1055:patentdata.201502_HAN_PATENTS] t0

INNER JOIN EACH [innometrics-1055:patentdata.tls209_appln_ipc] t1 on t0.appln_id = t1.appln_id

INNER JOIN EACH [innometrics-1055:patentdata.tls201_appln] t3 ON t1.Appln_id = t3.appln_id

WHERE HAN_id = 540673 AND t3.appln_filing_year BETWEEN 1990 AND 2010


Please take the ipc technology portfolio of a company (preferably a larger company than DAF) for the years 1990-2000. Create the overlay in Gephi to determine its position. Based on this information, you may predict which technology classes are likely to be added to the technology portfolio of a firm based on the model of technology space.

Now take the ipc technology portfolio of the same company for 2001 – 2012 and discuss whether your predictions were right.

While we can select the portfolio for a firm, we can also do this for a country. Because countries are likely to have many classes associated, we can set a threshold.

The query below assigns all ipc classes who occur more than 99 times (between 1990 and 2010) the value ‘many’, all ipc classes occurring 1-99 times ‘few’.

SELECT t1.ipc_subclass_symbol AS Name,


WHEN count(distinct t1.appln_id) > 99 THEN ‘many’

ELSE ‘few’ END

AS occur_NL

FROM [innometrics-1055:patentdata.201502_HAN_PATENTS] t0

INNER JOIN EACH [innometrics-1055:patentdata.tls209_appln_ipc] t1 on t0.appln_id = t1.appln_id

INNER JOIN EACH [innometrics-1055:patentdata.tls201_appln] t3 ON t1.Appln_id = t3.appln_id

INNER JOIN EACH [innometrics-1055:patentdata.201202_HAN_PERSON] t4 ON t0.HAN_ID = t4.HAN_ID

WHERE t4.Person_ctry_code LIKE ‘NL’ AND t3.appln_filing_year BETWEEN 1990 AND 2010


OPTIONAL Additional empirical analysis: Scientific Journals and Journal Structures.

Let us turn to the website of the JCRs. It is not so easy to find the journals of science, technology, and innovation studies using the subject categories provided. Try to find some of the journals which you know; for example, under the heading “History and Philosophy of Science.” Research Policy is to be found under “Management” and Scientometrics is part of the Library and Information Science literature. Thus, let’s take the other approach of a specific journal, for example, Research Policy. Type the journal name and a new screen is provided which informs us about standard journal indicators like the impact factor, etc.

Click on the journal name. In the new screen, the ISI provides a wealth of information about the journal. Among other things, the different indicators are defined. Try to understand the impact factor or turn to Wikipedia for more information if you fail to grasp the definition from the formulas (at http://en.wikipedia.org/wiki/Impact_factor ).

OK, let’s move on. Click on the link for the “Cited journal data.” The cited journal tables and citing journal tables, respectively, provide all the basic information which one needs for constructing a citation matrix among journals included in the ISI databases. For example, you can see that Research Policy is cited in total 74 times by articles in Scientometrics and 30 times by articles in Social Studies of Science. In total, Research Policy is cited 2,470 times during 2005. As you can find by proceeding to the next pages, these citations are provide by a large number of journals (165), but only twenty or thirty of these journals contribute substantively to the citation pattern.

Network analysis

Journal citations occur in dense cluster of journals which cover specialties, but most journals have also long tales of the distribution. Thus, Research Policy is cited by articles in Research-Technology Management only twice, and vice versa this latter journal is cited four times by articles in Research Policy. A citation matrix can be constructed by feeding these numbers into a table or an Excel sheet as follows:



Research Policy Res-Techn. Management Scientometrics
Research Policy




Res-Techn. Management








The zeros are not real zeros, but missing values. All values lower than two are lumped together under the category “All others”. Can you add Social Studies of Science to this table?

The resulting file can be represented in the DL-format (data definition language) which social network analysts use as follows:

NR=4, NC=4
34     0     0     4
0   520    21    35
0    25   100     7
2    74    30   432

Table 1: Citation matrix in DL format.

Save this file as an ASCII text file and read this into Pajek. Draw the picture. Set the lines at different width. You will see that Pajek reads an asymmetrical (or 2-Mode) matrix like this one as a double matrix. The positions of the cited and citing journals are different and strongly connected because of the high values on the main diagonal.

One can distinguish cited and citing in Pajek by using Net > Partition > 2-Mode (Or in Pajek64; Network>2-Mode network>2-Mode to 1-Mode). Now two partitions are created of each four journals. If you draw the partitions now, you should see different colours for cited and citing positions. You can change these colours by going to Options > Colors > Partition Colors. If you go back to main menu, go to Partition > Make Cluster > “1”, redraw the partition and under Options > Mark Vertices Using > Mark Cluster Only, you will be able to visualize the (cited or citing?) patterns exclusively. Which patterns do we obtain: the cited or citing ones?

One can also extract partition 1 in the main menu as follow: Operations > Extract from Network > Partition 1. Why does the resulting picture show no relations?

Go back to the original 2-mode network (nr 1.) with 8 nodes. Choose Net > Transform > 2-Mode to 1-Mode. If you choose now Rows, you get the cited patterns; if you choose Columns the citing ones. How are they different or similar?

From Pajek to SPSS and vice versa

In previous lessons we have used the cosine-matrices instead of the raw scores. We will now go from the citation matrix to the cosine matrix using SPSS. In Table 2, the resulting cosine matrix is provided in the format of Pajek itself. But let’s do the exercise in order to understand the relations.

Return to the two-mode matrix with 8 nodes in Pajek (before the further processing). Click on Tools > SPSS > Current Network. Pajek opens a window which reports to you where you can find the file which you need for importing the data into SPSS. Find that file and double click on it. If it works, SPSS opens several windows. (If not, open SPSS, open the file as a syntax file, and Run > All.) In the matrix (in another window), you should find the same information as above, but now within SPSS. Inspect the matrix both in the variable and the data view. Try to understand it fully. The variable view describes the variables and, among other things, labels them.

SPSS computes almost exclusively in terms of variables, that is, columns of the matrix. (We shall see an exception to the rule below.)  Cases (rows) can be selected, grouped, and clustered, but are not the subject of analysis. Social Network analysis (Pajek) tends to “think” in terms of the rows. In the analysis above (using Pajek), for example, the rows were the first partition, and the columns the second. The rows represent the nodes and the columns the links attributed to them. Attributes can be variables and thus SPSS “thinks” the other way around. However, you can tumble the matrix (“transpose” it) in both programs.

Let’s make the cosine-matrix. Click on Analyze > Correlate > Distances. Bring the four variables (col1 to col4) to the right side. Compute distances between variables, using Similarities and Choose the cosine instead of the Pearson correlation. The Proximity matrix which is created, contains the cosine values. You can right-click on it and export it, for example, as an Excel file. I pasted the values of this matrix into Table 2 and added the headings in the so-called Pajek-format. This matrix is symmetrical and thus we need the labels only once. Try to replace my values with yours, take the file and import it into Pajek by saving it first as an ASCII Plain text file (DOS with CR/LF, that is “carriage return and line feed” as with the old type-writers).

*Vertices 4
1 "ResTechnolManage"
2 "Scientometrics"
3 "SocStudSci"
4 "ResPolicy"
1.000 0.008 0.017 0.068
0.008 1.000 0.279 0.221
0.017 0.279 1.000 0.312
0.068 0.221 0.312 1.000

Table 2: Cosine values for the citing patterns of four journals.

One can directly import an Excel file into Pajek using the program CreatePajek available at http://vlado.fmf.uni-lj.si/pub/networks/pajek/howto/excel2Pajek.htm .

Mapping science and technology using journal structures

After importing this file into Pajek, the results are at first a bit disappointing because all the relations have values. (The cosine runs from 0 to 1.) Would we have chosen the Pearson correlation some of the relations would have been negative, but for various reasons scientometricians prefer to use the cosine for the visualization (Ahlgren et al., 2003). (For further statistical analysis, the Pearson may be the better choice.)

Go to the main menu in Pajek and under Net > Tranform > Remove > Lines with values lower than 0.2 . Redraw.

At http://www.leydesdorff.net/jcr05 , the Pajek files are provided for the citation environments of all the journals included in the Science Citation Index and the Social Science Citation Index with this threshold of 0.2. Scroll down to Research Policy, save the file as a text file, and import it into Pajek. The file contains the citation environment of all the journals which cite Research Policy to the extent of more than one percent of its total citations. Remember from above that the total citations of Research Policy were 2,470. One percent is 24.7, and thus the 23 journals citing more than 24 times are included. Cosine values below 0.2 are removed in order to enable the user to generate from these files easily a meaningful picture.

After making partitions using the core-option and removing partition zero (using the techniques explained above), you should be able to obtain a picture like this:

Can you provide this with an interpretation? Note that the relation between Social Studies of Science and Research Policy is no longer a direct one, but both of these two journals share a pattern of being cited by authors with papers in Scientometrics and Research Evaluation.

The data in the file which you downloaded from the webpage contains additional information about the relative share of the citations of the journals. After each label an x-factor and an y-factor are specified. You can turn this additional information on by using Options > Size > of Vertices defined in input file. You may have to adjust the size. The nodes are now depicted proportional to the y-factors and the x-factors in the input file. The y-factor provides the percentage of the citations in this local environment, and the x-factor this same percentage after correction for “within-journal self-citations,” that is, the value on the main diagonal of the citation matrix. What do you see if you turn this on? Export this picture and make it part of the submission for the third mid-term by importing the picture into Word.

Let’s do the same exercise for the citing file which you can find at http://www.leydesdorff.net/jcr05/citing . How are the citing patterns different from the cited, and why?

Repeat the analysis for a journal which is central to your own research interests. Can you learn something about the structure of this field?




Hausmann, Ricardo, and César A. Hidalgo. The atlas of economic complexity: Mapping paths to prosperity. MIT Press, 2014.

Hidalgo, Cesar A., and Ricardo Hausmann. “A network view of economic development.” Developing alternatives 12.1 (2008): 5-10.

Hodgson, Geoffrey M., and Thorbjørn Knudsen. Darwin’s conjecture: The search for general principles of social and economic evolution. University of Chicago Press, 2010.