Wrangling an Excel Spreadsheet with R and the Tidyverse

Update: While I was wrapping up this post, Duncan Garmonsway provided a simpler method on twitter. I’ve added his way at the bottom of this post.


I recently tweeted that our Ministry of Science and Higher Education released an all-new point system for journals.

In a moment of supreme irony, the Ministry proclaims the end of point-osis, the idea that scientists have to concern themselves with points for evaluations rather than science. Of course, this new listing is simply a new point system for exactly that: evaluating scientists and universities.

Briefly, it’s composed of two key aspects: points and disciplines. Scientists have to declare every two years what discipline they will be evaluated in (you can pick a secondary as well) and then their university will be evaluated in each discipline according to points accumulated divided by the number of scientists declared. The evaluation grade then determined funding from the Ministry, as well as other aspects such as ability to grant PhD (and higher) degrees, etc. The points will also be used in staff evaluations. So all in all, it’s kind of a big deal to know what journals fall in which discipline and how many points they are worth.

Being the good employee that I am, I downloaded the new compendium of journals—I was pleasantly surprised to find there was an Excel file! Alas, the pleasantries were over quite quickly, as I noted on twitter:

Here’s what the top of the spreadsheet looks like: Gnarly Excel spreadsheet of journals, points, and disciplines. As you can see, it’s gnarly, with columns on the left providing titles (in assorted capitalization styles), ISSNs—all duplicated for some reason—and finally Punkty—the all important points. So far so good, till we look to the right and despair. The disciplines are given as column headings (stylishly vertically oriented), with x’s serving to indicate classification. Good luck easily assessing what journals are classified in your discipline of interest and how much they are worth…

What I set out to do was to to wrangle this mess into something a bit more useful, such as a listing of journals in Materials Science (my declared discipline, “inżynieria materiałowa”), as well as those journals that are MatSci and a related discipline that I know we have collaborators in such as Medicine (“nauki medyczne”) and Chemical Engineering (“inżynieria chemiczna”). You thought getting coauthors to agree on a journal was hard before? Try now with this system where your coauthors may be declared in a different discipline! While this is probably doable in Excel, I thought this was a good opportunity to play around with some R tools, namely the tidyverse packages, as well as tidyxl and unpivotr packages (there is a really nice ebook Spreadsheet Munging Strategies by Duncan Garmonsway).

Let’s load the packages:

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
library(tidyxl)
## Warning: package 'tidyxl' was built under R version 3.6.2
library(unpivotr)

And then read in the spreadsheet, that I had previously downloaded and renamed, using tidyxl.

#Read in the spreadsheet, just the first sheet, journals
wykaz_pub <- xlsx_cells("wykaz_pub.xlsx", sheet="Czasopisma")

If we view wykaz_pub you will see pretty much a mess, but that’s because tidyxl is reading not just the contents, but all of the additional information, such as row, column, formatting, data types, etc. This allows you much greater ability to wrangle the sheet into something useful.

Next, let’s remove the blank lines and select just the useful stuff, fields with letters (character) and numbers (numeric). Here I use the dplyr verbs filter and select, as well as pipes (%>%) to link the operations together.

#remove blank rows and just select useful stuff
wykaz_pub %>% filter(row>=3, !is_blank) %>% 
  select(row, col, data_type, character, numeric) -> wykaz_1

Now we need to make sense of the disciplines. The names are all located in the 3rd row, so we will filter for that row and then keep just the row and column information, while defining a new column for the character data.

#define header, the disciplines, these were a row at the top
first_header_row <-
  filter(wykaz_1, row == 3) %>%
  select(row, col, discip = character)
first_header_row
## # A tibble: 44 x 3
##      row   col discip                                   
##    <int> <int> <chr>                                    
##  1     3     9 archeologia                              
##  2     3    10 filozofia                                
##  3     3    11 historia                                 
##  4     3    12 językoznawstwo                           
##  5     3    13 literaturoznawstwo                       
##  6     3    14 nauki o kulturze i religii               
##  7     3    15 nauki o sztuce                           
##  8     3    16 architektura i urbanistyka               
##  9     3    17 automatyka, elektronika i elektrotechnika
## 10     3    18 informatyka techniczna i telekomunikacja 
## # … with 34 more rows

Looks like we have the disciplines extracted, so we move on to extracting the titles and the points.

#define a column header for titles
title_header_col <-
  filter(wykaz_1, row >= 5, col==2) %>%   select(row, col, Title = character)
#pick out the points column 
Pts <- wykaz_1 %>% filter(row>=5, col==8) %>% 
  select(row, col, Pts=numeric)

So far so good. Now we need to combine the titles with the disciplines. This is where it gets a bit tricky and we use unpivotr.

#combine the data with the Discipline  header
wykaz_1 %>% filter(row>=5, col>=9) %>% 
  select(row, col, categ=character) %>%
  enhead(first_header_row, "N") %>%
  enhead(title_header_col, "W") %>%
  select(-row, -col) %>% filter(categ!="") -> discip_table
discip_table
## # A tibble: 108,340 x 3
##    categ discip                                    Title       
##    <chr> <chr>                                     <chr>       
##  1 x     architektura i urbanistyka                2D Materials
##  2 x     automatyka, elektronika i elektrotechnika 2D Materials
##  3 x     inżynieria biomedyczna                    2D Materials
##  4 x     inżynieria lądowa i transport             2D Materials
##  5 x     inżynieria materiałowa                    2D Materials
##  6 x     inżynieria mechaniczna                    2D Materials
##  7 x     nauki farmaceutyczne                      2D Materials
##  8 x     nauki chemiczne                           2D Materials
##  9 x     nauki fizyczne                            2D Materials
## 10 x     architektura i urbanistyka                3 Biotech   
## # … with 108,330 more rows

This one is a bit tricky. What we’re doing is focusing on the region of the spreadsheet where the x’s indicate classification: rows 5+ and columns 9+. Using select we then keep the row and column info, while creating a new column called categ which includes the character data type: the x’s. Then, the enhead function of unpivotr associates those data cells, the x’s, with the headers we extracted previously. The “N” indicates that the header cells are above the data cells. Likewise, we then associate the rows (the classifications) with the journal titles, the header column we also extracted earlier. These are to the left, hence “W”. Finally, we clean up by removing the row and column indicators and any empty classifications.

In the same fashion, we combine the points data with the journal titles.

#combine the data with the Pts column header
wykaz_1 %>% filter(row>=5, col==8) %>% 
  select(row, col, Pts=numeric) %>% 
  enhead(title_header_col, "W") %>%
  select(-row, -col) -> Pts_table

Now we have a good starting point for further manipulation using tidyverse tools. For example, lets filter for journals classified as material science (“inżynieria materiałowa”).

discip_table %>% filter(discip=="inżynieria materiałowa") -> MatEng
MatEng
## # A tibble: 2,403 x 3
##    categ discip               Title                                             
##    <chr> <chr>                <chr>                                             
##  1 x     inżynieria materiał… 2D Materials                                      
##  2 x     inżynieria materiał… 3D Printing and Additive Manufacturing            
##  3 x     inżynieria materiał… AAPS PHARMSCITECH                                 
##  4 x     inżynieria materiał… AATCC Journal of Research                         
##  5 x     inżynieria materiał… AATCC REVIEW                                      
##  6 x     inżynieria materiał… ACCOUNTS OF CHEMICAL RESEARCH                     
##  7 x     inżynieria materiał… ACI MATERIALS JOURNAL                             
##  8 x     inżynieria materiał… ACI STRUCTURAL JOURNAL                            
##  9 x     inżynieria materiał… ACM Journal on Emerging Technologies in Computing…
## 10 x     inżynieria materiał… ACM Transactions on Modeling and Computer Simulat…
## # … with 2,393 more rows

And what about the points? We just need to do a join operation with the points table.

left_join(MatEng, Pts_table, by="Title") -> MatEngPts
MatEngPts
## # A tibble: 2,406 x 4
##    categ discip              Title                                           Pts
##    <chr> <chr>               <chr>                                         <dbl>
##  1 x     inżynieria materia… 2D Materials                                    140
##  2 x     inżynieria materia… 3D Printing and Additive Manufacturing          100
##  3 x     inżynieria materia… AAPS PHARMSCITECH                               100
##  4 x     inżynieria materia… AATCC Journal of Research                        20
##  5 x     inżynieria materia… AATCC REVIEW                                     20
##  6 x     inżynieria materia… ACCOUNTS OF CHEMICAL RESEARCH                   200
##  7 x     inżynieria materia… ACI MATERIALS JOURNAL                            70
##  8 x     inżynieria materia… ACI STRUCTURAL JOURNAL                          100
##  9 x     inżynieria materia… ACM Journal on Emerging Technologies in Comp…    70
## 10 x     inżynieria materia… ACM Transactions on Modeling and Computer Si…    70
## # … with 2,396 more rows

The left_join function from dplyr will return the rows of MatEng and the columns of MatEng and Pts_table, thus giving us the result we want: points data for each journal. We can confirm by seeing that 3D Printing and Additive Manufacturing is matched with 100 points, which is in agreement with the original Excel spreadsheet (see the figure above).

What else can we do? Well, we can quickly search for a journal.

MatEngPts %>% filter(str_detect(str_to_lower(Title), "langmuir"))
## # A tibble: 1 x 4
##   categ discip                 Title      Pts
##   <chr> <chr>                  <chr>    <dbl>
## 1 x     inżynieria materiałowa LANGMUIR   100

Here I used str_detect and str_to_lower to wrangle the titles into all lower case for easier searching. We can use a similar strategy to filter journal names by a word/fragment, such as nano. Below, I do that and plot a histogram of the points.

#filter then plot histogram
library(cowplot)
library(scales)
MatEngPts %>% 
  filter(str_detect(str_to_lower(Title), "nano")) %>%
  ggplot(aes(x=Pts, y=..count../sum(..count..)))+geom_histogram(binwidth = 10)+
  scale_x_continuous(breaks=c(0,20,40,70,100,140,200), expand = c(0,0), limits=c(0,215))+
  ylab("Percentage in bin (%)")+xlab("Points")+
  scale_y_continuous(expand=c(0,0), limits=c(0,.41), labels=percent_format())+
  labs(title="'Mat Sci' Journals with 'nano' in name")

Shout out to cowplot by Claus Wilke, a must-use package along with ggplot.

Finally, I wanted to make some new spreadsheets for my boss and coworkers. I wanted to include the material science journal titles and points, but also all of the other disciplines in a readable fashion.

#Make the Mat Inż table by wide with all cross discip to export
discip_table %>% group_by(Title) %>% 
  summarize(Disciplines = paste(discip, collapse=", ")) %>%
  filter(str_detect(str_to_lower(Disciplines), "inżynieria materiałowa"))    -> MatEng_wide
## `summarise()` ungrouping output (override with `.groups` argument)
#add points
left_join(MatEng_wide, Pts_table, by="Title") -> MatEng_wide_pts
#reorder
MatEng_wide_pts <- MatEng_wide_pts[c(1,3,2)]
MatEng_wide_pts
## # A tibble: 2,404 x 3
##    Title                        Pts Disciplines                                 
##    <chr>                      <dbl> <chr>                                       
##  1 2D Materials                 140 architektura i urbanistyka, automatyka, ele…
##  2 3D Printing and Additive …   100 architektura i urbanistyka, automatyka, ele…
##  3 3D Printing in Medicine       40 informatyka techniczna i telekomunikacja, i…
##  4 AAPS PHARMSCITECH            100 inżynieria biomedyczna, inżynieria materiał…
##  5 AATCC Journal of Research     20 inżynieria biomedyczna, inżynieria materiał…
##  6 AATCC REVIEW                  20 inżynieria biomedyczna, inżynieria lądowa i…
##  7 Academic Journal of Manuf…    40 automatyka, elektronika i elektrotechnika, …
##  8 ACCOUNTS OF CHEMICAL RESE…   200 inżynieria biomedyczna, inżynieria materiał…
##  9 ACI MATERIALS JOURNAL         70 inżynieria biomedyczna, inżynieria chemiczn…
## 10 ACI STRUCTURAL JOURNAL       100 inżynieria biomedyczna, inżynieria lądowa i…
## # … with 2,394 more rows

Now using the dplyr verbs group_by and summarize I make the table wide, pasting all of the disciplines together, separated by commas, in a new column. Then, filter for material science and again use the left_join verb to associated point values with titles. Finally, reshuffle the order of the columns, so that it goes: title, points, disciplines.

We collaborated with people how likely selected Medicine (“nauki medyczne”) and Chemical Engineering (“inżynieria chemiczna”), so I wanted to make similar sheets that present journals that are classified as both Material Science and Medicine, as well as Material Science and Chemical Engineering, and finally, those that are classified as all three.

#Make the Chem Eng table wide with all disciplines
discip_table %>% group_by(Title) %>% 
  summarise(Disciplines = paste(discip, collapse=", ")) %>%
  filter(str_detect(str_to_lower(Disciplines), "inżynieria chemiczna"))    -> ChemEng_wide
## `summarise()` ungrouping output (override with `.groups` argument)
#add points
left_join(ChemEng_wide, Pts_table, by="Title") -> ChemEng_wide_pts
#reorder
ChemEng_wide_pts <- ChemEng_wide_pts[c(1,3,2)]
#Make the NMed table table wide with all disciplines
discip_table %>% group_by(Title) %>% 
  summarise(Disciplines = paste(discip, collapse=", ")) %>%
  filter(str_detect(str_to_lower(Disciplines), "nauki medyczne"))    -> NMed_wide
## `summarise()` ungrouping output (override with `.groups` argument)
#add points
left_join(NMed_wide, Pts_table, by="Title") -> NMed_wide_pts
#reorder
NMed_wide_pts <- NMed_wide_pts[c(1,3,2)]
#Perform joins to identify journals in both tables
Mat_Chem_wide_pts <- semi_join(MatEng_wide_pts, ChemEng_wide_pts, by="Title")
Mat_Med_wide_pts <- semi_join(MatEng_wide_pts, NMed_wide_pts, by="Title")
Mat_Chem_Med <- semi_join(Mat_Chem_wide_pts, Mat_Med_wide_pts, by="Title")
Mat_Chem_Med
## # A tibble: 82 x 3
##    Title                        Pts Disciplines                                 
##    <chr>                      <dbl> <chr>                                       
##  1 ACS Biomaterials Science …   100 inżynieria biomedyczna, inżynieria chemiczn…
##  2 Acta Biomaterialia           140 inżynieria biomedyczna, inżynieria chemiczn…
##  3 Acta of Bioengineering an…   100 automatyka, elektronika i elektrotechnika, …
##  4 Analytical and Bioanalyti…    20 inżynieria biomedyczna, inżynieria chemiczn…
##  5 Applied Adhesion Science      70 inżynieria biomedyczna, inżynieria chemiczn…
##  6 APPLIED BIOCHEMISTRY AND …    70 inżynieria biomedyczna, inżynieria chemiczn…
##  7 Applied Bionics and Biome…    70 automatyka, elektronika i elektrotechnika, …
##  8 Bio-Design and Manufactur…    20 automatyka, elektronika i elektrotechnika, …
##  9 Bioactive Materials          200 inżynieria biomedyczna, inżynieria chemiczn…
## 10 Biocatalysis and Agricult…    70 inżynieria biomedyczna, inżynieria chemiczn…
## # … with 72 more rows

And there we have it, the 82 journals that the Ministry classified as being in Material Science, Chemical Engineering, and Medicine. Now we know where to publish for the next 2 years!

Let’s take a look at the histogram of points:

Mat_Chem_Med %>% 
  ggplot(aes(x=Pts, y=..count..))+geom_histogram(binwidth = 10)+
  scale_x_continuous(breaks=c(0,20,40,70,100,140,200), expand = c(0,0), limits=c(0,215))+
  ylab("Number in bin")+xlab("Points")+
  scale_y_continuous(expand=c(0,0), limits=c(0,27))+
  labs(title="Material Science, Chemical Engineering, and Medicine")

Well, at least a good number are worth 100+ points.

The final thing I wanted to do was to export the new tables back to Excel so my colleagues could benefit from my work. This was very easy using the writexl package.

library(writexl)
#export each table to XLSX each on own sheet
list_of_datasets <- list("Inż Mat" = MatEng_wide_pts, "Inż Chem" = ChemEng_wide_pts, "Inż Mat & Inz Chem" = Mat_Chem_wide_pts, "Inż Mat & N Med" = Mat_Med_wide_pts, "Inż Mat & Inż Chem & N Med" = Mat_Chem_Med)
write_xlsx(list_of_datasets, "wykaz_mat_chem_med.xlsx")

We make a list of the tables we want to export, with names, and then export them all to one Excel file, with each table on a separate sheet. Done!

So that’s a code-through for how I wrangled this spreadsheet to get something useful. I’m sure it’s not the only way to do it, nor the most efficient, but it worked for me and I learned quite a bit through the process. Once I had it done I decided to see how putting together a blog post with R chunks and output would work. It was actually really easy! One of the huge advantages of using R, RStudio, and the tidyverse is that there is a huge, helpful, and friendly community with tons of available resources.

If you want to play with this, you can download my entire R script.


Update

Here’s the simpler and better solution from Duncan Garmonsway:

library(readxl)
col_names <- read_excel("wykaz_pub.xlsx", sheet="Czasopisma", skip=1, n_max=3, col_names = FALSE) %>%
  mutate_all(replace_na, "") %>%
  arrange(2:1) %>%   map_chr(paste0, collapse= " ")
discip_table_2 <- read_excel("wykaz_pub.xlsx", sheet="Czasopisma", skip=4, col_names = col_names) %>%
  gather("discip", "category", matches("^[0-9].*")) %>%
  filter(!is.na(category)) %>%
  select(-category)

Here the spreadsheet is first read in using readxl and the column headers and discipline names (with numeric code) are extracted. To deal with the two rows used as headings, the cells are merged together using map_chr. Next, the table is read in again, dropping the first 4 rows to get at the data only and the previously generated column names in col_names are used as column names. gather is used to convert the wide discipline columns with the x’s into a pair of columns: discip and category. Then, filter is used to drop unmarked disciplines and finally the category column is dropped. Once this is done, we have a long version of the spreadsheet, which can be further manipulated in similar fashion as I did above. For example, we can filter and make the table wide, more human readable again.

discip_table_2 %>% group_by(`Lp. `, `Tytuł 1 `, `Punkty `) %>% 
  summarise(Disciplines = paste(discip, collapse=", ")) %>%
  filter(str_detect(str_to_lower(Disciplines), "inżynieria materiałowa"))
## `summarise()` regrouping output by 'Lp. ', 'Tytuł 1 ' (override with `.groups` argument)
## # A tibble: 2,403 x 4
## # Groups:   Lp. , Tytuł 1 [2,403]
##    `Lp. ` `Tytuł 1 `                 `Punkty ` Disciplines                      
##     <dbl> <chr>                          <dbl> <chr>                            
##  1      1 2D Materials                     140 201 architektura i urbanistyka, …
##  2      6 3D Printing and Additive …       100 201 architektura i urbanistyka, …
##  3     19 AAPS PHARMSCITECH                100 204 inżynieria biomedyczna, 207 …
##  4     20 AATCC Journal of Research         20 204 inżynieria biomedyczna, 207 …
##  5     21 AATCC REVIEW                      20 204 inżynieria biomedyczna, 206 …
##  6     69 ACCOUNTS OF CHEMICAL RESE…       200 204 inżynieria biomedyczna, 207 …
##  7     74 ACI MATERIALS JOURNAL             70 204 inżynieria biomedyczna, 205 …
##  8     75 ACI STRUCTURAL JOURNAL           100 204 inżynieria biomedyczna, 206 …
##  9     80 ACM Journal on Emerging T…        70 203 informatyka techniczna i tel…
## 10    106 ACM Transactions on Model…        70 201 architektura i urbanistyka, …
## # … with 2,393 more rows

Here I opt to keep the original row number Lp., the titles, point values and recombine the disciplines into 1 column with the names comma separated. Then filter for Material Science.

Like I said above, the R community on twitter is awesome! I have to add a shout-out to Mara Averick whose numerous tweets have opened my eyes to new packages and things to try and play with.

Avatar
Peter Sobolewski
Assistant Professor

My research interests include bioengineering and biomaterials.

Related