Update: While I was wrapping up this post, Duncan Garmonsway provided a simpler method on twitter. I’ve added his way at the bottom of this post.
I recently tweeted that our Ministry of Science and Higher Education released an all-new point system for journals.
In a moment of supreme irony, the Ministry proclaims the end of point-osis, the idea that scientists have to concern themselves with points for evaluations rather than science. Of course, this new listing is simply a new point system for exactly that: evaluating scientists and universities."The end of point-osis!" proclaims our Minstry of Science and Higher Ed @NAUKA_GOV_PL as it releases an all-new journal ranking point system to be used for evaluations. #irony #facepalm #academiclifehttps://t.co/dWOZiXAMdj
— Peter Sobolewski (@psobolewskiPhD) August 2, 2019
Briefly, it’s composed of two key aspects: points and disciplines. Scientists have to declare every two years what discipline they will be evaluated in (you can pick a secondary as well) and then their university will be evaluated in each discipline according to points accumulated divided by the number of scientists declared. The evaluation grade then determined funding from the Ministry, as well as other aspects such as ability to grant PhD (and higher) degrees, etc. The points will also be used in staff evaluations. So all in all, it’s kind of a big deal to know what journals fall in which discipline and how many points they are worth.
Being the good employee that I am, I downloaded the new compendium of journals—I was pleasantly surprised to find there was an Excel file! Alas, the pleasantries were over quite quickly, as I noted on twitter:
Here’s what the top of the spreadsheet looks like:Useless data analysis with #Rstats on the @NAUKA_GOV_PL spreadsheet of journals with disciplines and points I mentioned in my previous tweet. This is one gnarly Excel spreadsheet! Check out the pro use of x's to mark disciplines. At least they aren't using a color fill... 1/n # pic.twitter.com/Tm6YHo8HNL
— Peter Sobolewski (@psobolewskiPhD) August 2, 2019

What I set out to do was to to wrangle this mess into something a bit more useful, such as a listing of journals in Materials Science (my declared discipline, “inżynieria materiałowa”), as well as those journals that are MatSci and a related discipline that I know we have collaborators in such as Medicine (“nauki medyczne”) and Chemical Engineering (“inżynieria chemiczna”). You thought getting coauthors to agree on a journal was hard before? Try now with this system where your coauthors may be declared in a different discipline! While this is probably doable in Excel, I thought this was a good opportunity to play around with some R tools, namely the tidyverse packages, as well as tidyxl and unpivotr packages (there is a really nice ebook Spreadsheet Munging Strategies by Duncan Garmonsway).
Let’s load the packages:
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
library(tidyxl)
## Warning: package 'tidyxl' was built under R version 3.6.2
library(unpivotr)
And then read in the spreadsheet, that I had previously downloaded and renamed, using tidyxl.
#Read in the spreadsheet, just the first sheet, journals
wykaz_pub <- xlsx_cells("wykaz_pub.xlsx", sheet="Czasopisma")
If we view wykaz_pub you will see pretty much a mess, but that’s because tidyxl is reading not just the contents, but all of the additional information, such as row, column, formatting, data types, etc. This allows you much greater ability to wrangle the sheet into something useful.
Next, let’s remove the blank lines and select just the useful stuff, fields with letters (character) and numbers (numeric). Here I use the dplyr verbs filter and select, as well as pipes (%>%) to link the operations together.
#remove blank rows and just select useful stuff
wykaz_pub %>% filter(row>=3, !is_blank) %>%
select(row, col, data_type, character, numeric) -> wykaz_1
Now we need to make sense of the disciplines. The names are all located in the 3rd row, so we will filter for that row and then keep just the row and column information, while defining a new column for the character data.
#define header, the disciplines, these were a row at the top
first_header_row <-
filter(wykaz_1, row == 3) %>%
select(row, col, discip = character)
first_header_row
## # A tibble: 44 x 3
## row col discip
## <int> <int> <chr>
## 1 3 9 archeologia
## 2 3 10 filozofia
## 3 3 11 historia
## 4 3 12 językoznawstwo
## 5 3 13 literaturoznawstwo
## 6 3 14 nauki o kulturze i religii
## 7 3 15 nauki o sztuce
## 8 3 16 architektura i urbanistyka
## 9 3 17 automatyka, elektronika i elektrotechnika
## 10 3 18 informatyka techniczna i telekomunikacja
## # … with 34 more rows
Looks like we have the disciplines extracted, so we move on to extracting the titles and the points.
#define a column header for titles
title_header_col <-
filter(wykaz_1, row >= 5, col==2) %>% select(row, col, Title = character)
#pick out the points column
Pts <- wykaz_1 %>% filter(row>=5, col==8) %>%
select(row, col, Pts=numeric)
So far so good. Now we need to combine the titles with the disciplines. This is where it gets a bit tricky and we use unpivotr.
#combine the data with the Discipline header
wykaz_1 %>% filter(row>=5, col>=9) %>%
select(row, col, categ=character) %>%
enhead(first_header_row, "N") %>%
enhead(title_header_col, "W") %>%
select(-row, -col) %>% filter(categ!="") -> discip_table
discip_table
## # A tibble: 108,340 x 3
## categ discip Title
## <chr> <chr> <chr>
## 1 x architektura i urbanistyka 2D Materials
## 2 x automatyka, elektronika i elektrotechnika 2D Materials
## 3 x inżynieria biomedyczna 2D Materials
## 4 x inżynieria lądowa i transport 2D Materials
## 5 x inżynieria materiałowa 2D Materials
## 6 x inżynieria mechaniczna 2D Materials
## 7 x nauki farmaceutyczne 2D Materials
## 8 x nauki chemiczne 2D Materials
## 9 x nauki fizyczne 2D Materials
## 10 x architektura i urbanistyka 3 Biotech
## # … with 108,330 more rows
This one is a bit tricky. What we’re doing is focusing on the region of the spreadsheet where the x’s indicate classification: rows 5+ and columns 9+. Using select we then keep the row and column info, while creating a new column called categ which includes the character data type: the x’s. Then, the enhead function of unpivotr associates those data cells, the x’s, with the headers we extracted previously. The “N” indicates that the header cells are above the data cells. Likewise, we then associate the rows (the classifications) with the journal titles, the header column we also extracted earlier. These are to the left, hence “W”. Finally, we clean up by removing the row and column indicators and any empty classifications.
In the same fashion, we combine the points data with the journal titles.
#combine the data with the Pts column header
wykaz_1 %>% filter(row>=5, col==8) %>%
select(row, col, Pts=numeric) %>%
enhead(title_header_col, "W") %>%
select(-row, -col) -> Pts_table
Now we have a good starting point for further manipulation using tidyverse tools. For example, lets filter for journals classified as material science (“inżynieria materiałowa”).
discip_table %>% filter(discip=="inżynieria materiałowa") -> MatEng
MatEng
## # A tibble: 2,403 x 3
## categ discip Title
## <chr> <chr> <chr>
## 1 x inżynieria materiał… 2D Materials
## 2 x inżynieria materiał… 3D Printing and Additive Manufacturing
## 3 x inżynieria materiał… AAPS PHARMSCITECH
## 4 x inżynieria materiał… AATCC Journal of Research
## 5 x inżynieria materiał… AATCC REVIEW
## 6 x inżynieria materiał… ACCOUNTS OF CHEMICAL RESEARCH
## 7 x inżynieria materiał… ACI MATERIALS JOURNAL
## 8 x inżynieria materiał… ACI STRUCTURAL JOURNAL
## 9 x inżynieria materiał… ACM Journal on Emerging Technologies in Computing…
## 10 x inżynieria materiał… ACM Transactions on Modeling and Computer Simulat…
## # … with 2,393 more rows
And what about the points? We just need to do a join operation with the points table.
left_join(MatEng, Pts_table, by="Title") -> MatEngPts
MatEngPts
## # A tibble: 2,406 x 4
## categ discip Title Pts
## <chr> <chr> <chr> <dbl>
## 1 x inżynieria materia… 2D Materials 140
## 2 x inżynieria materia… 3D Printing and Additive Manufacturing 100
## 3 x inżynieria materia… AAPS PHARMSCITECH 100
## 4 x inżynieria materia… AATCC Journal of Research 20
## 5 x inżynieria materia… AATCC REVIEW 20
## 6 x inżynieria materia… ACCOUNTS OF CHEMICAL RESEARCH 200
## 7 x inżynieria materia… ACI MATERIALS JOURNAL 70
## 8 x inżynieria materia… ACI STRUCTURAL JOURNAL 100
## 9 x inżynieria materia… ACM Journal on Emerging Technologies in Comp… 70
## 10 x inżynieria materia… ACM Transactions on Modeling and Computer Si… 70
## # … with 2,396 more rows
The left_join function from dplyr will return the rows of MatEng and the columns of MatEng and Pts_table, thus giving us the result we want: points data for each journal. We can confirm by seeing that 3D Printing and Additive Manufacturing is matched with 100 points, which is in agreement with the original Excel spreadsheet (see the figure above).
What else can we do? Well, we can quickly search for a journal.
MatEngPts %>% filter(str_detect(str_to_lower(Title), "langmuir"))
## # A tibble: 1 x 4
## categ discip Title Pts
## <chr> <chr> <chr> <dbl>
## 1 x inżynieria materiałowa LANGMUIR 100
Here I used str_detect and str_to_lower to wrangle the titles into all lower case for easier searching. We can use a similar strategy to filter journal names by a word/fragment, such as nano. Below, I do that and plot a histogram of the points.
#filter then plot histogram
library(cowplot)
library(scales)
MatEngPts %>%
filter(str_detect(str_to_lower(Title), "nano")) %>%
ggplot(aes(x=Pts, y=..count../sum(..count..)))+geom_histogram(binwidth = 10)+
scale_x_continuous(breaks=c(0,20,40,70,100,140,200), expand = c(0,0), limits=c(0,215))+
ylab("Percentage in bin (%)")+xlab("Points")+
scale_y_continuous(expand=c(0,0), limits=c(0,.41), labels=percent_format())+
labs(title="'Mat Sci' Journals with 'nano' in name")
Shout out to cowplot by Claus Wilke, a must-use package along with ggplot.
Finally, I wanted to make some new spreadsheets for my boss and coworkers. I wanted to include the material science journal titles and points, but also all of the other disciplines in a readable fashion.
#Make the Mat Inż table by wide with all cross discip to export
discip_table %>% group_by(Title) %>%
summarize(Disciplines = paste(discip, collapse=", ")) %>%
filter(str_detect(str_to_lower(Disciplines), "inżynieria materiałowa")) -> MatEng_wide
## `summarise()` ungrouping output (override with `.groups` argument)
#add points
left_join(MatEng_wide, Pts_table, by="Title") -> MatEng_wide_pts
#reorder
MatEng_wide_pts <- MatEng_wide_pts[c(1,3,2)]
MatEng_wide_pts
## # A tibble: 2,404 x 3
## Title Pts Disciplines
## <chr> <dbl> <chr>
## 1 2D Materials 140 architektura i urbanistyka, automatyka, ele…
## 2 3D Printing and Additive … 100 architektura i urbanistyka, automatyka, ele…
## 3 3D Printing in Medicine 40 informatyka techniczna i telekomunikacja, i…
## 4 AAPS PHARMSCITECH 100 inżynieria biomedyczna, inżynieria materiał…
## 5 AATCC Journal of Research 20 inżynieria biomedyczna, inżynieria materiał…
## 6 AATCC REVIEW 20 inżynieria biomedyczna, inżynieria lądowa i…
## 7 Academic Journal of Manuf… 40 automatyka, elektronika i elektrotechnika, …
## 8 ACCOUNTS OF CHEMICAL RESE… 200 inżynieria biomedyczna, inżynieria materiał…
## 9 ACI MATERIALS JOURNAL 70 inżynieria biomedyczna, inżynieria chemiczn…
## 10 ACI STRUCTURAL JOURNAL 100 inżynieria biomedyczna, inżynieria lądowa i…
## # … with 2,394 more rows
Now using the dplyr verbs group_by and summarize I make the table wide, pasting all of the disciplines together, separated by commas, in a new column. Then, filter for material science and again use the left_join verb to associated point values with titles. Finally, reshuffle the order of the columns, so that it goes: title, points, disciplines.
We collaborated with people how likely selected Medicine (“nauki medyczne”) and Chemical Engineering (“inżynieria chemiczna”), so I wanted to make similar sheets that present journals that are classified as both Material Science and Medicine, as well as Material Science and Chemical Engineering, and finally, those that are classified as all three.
#Make the Chem Eng table wide with all disciplines
discip_table %>% group_by(Title) %>%
summarise(Disciplines = paste(discip, collapse=", ")) %>%
filter(str_detect(str_to_lower(Disciplines), "inżynieria chemiczna")) -> ChemEng_wide
## `summarise()` ungrouping output (override with `.groups` argument)
#add points
left_join(ChemEng_wide, Pts_table, by="Title") -> ChemEng_wide_pts
#reorder
ChemEng_wide_pts <- ChemEng_wide_pts[c(1,3,2)]
#Make the NMed table table wide with all disciplines
discip_table %>% group_by(Title) %>%
summarise(Disciplines = paste(discip, collapse=", ")) %>%
filter(str_detect(str_to_lower(Disciplines), "nauki medyczne")) -> NMed_wide
## `summarise()` ungrouping output (override with `.groups` argument)
#add points
left_join(NMed_wide, Pts_table, by="Title") -> NMed_wide_pts
#reorder
NMed_wide_pts <- NMed_wide_pts[c(1,3,2)]
#Perform joins to identify journals in both tables
Mat_Chem_wide_pts <- semi_join(MatEng_wide_pts, ChemEng_wide_pts, by="Title")
Mat_Med_wide_pts <- semi_join(MatEng_wide_pts, NMed_wide_pts, by="Title")
Mat_Chem_Med <- semi_join(Mat_Chem_wide_pts, Mat_Med_wide_pts, by="Title")
Mat_Chem_Med
## # A tibble: 82 x 3
## Title Pts Disciplines
## <chr> <dbl> <chr>
## 1 ACS Biomaterials Science … 100 inżynieria biomedyczna, inżynieria chemiczn…
## 2 Acta Biomaterialia 140 inżynieria biomedyczna, inżynieria chemiczn…
## 3 Acta of Bioengineering an… 100 automatyka, elektronika i elektrotechnika, …
## 4 Analytical and Bioanalyti… 20 inżynieria biomedyczna, inżynieria chemiczn…
## 5 Applied Adhesion Science 70 inżynieria biomedyczna, inżynieria chemiczn…
## 6 APPLIED BIOCHEMISTRY AND … 70 inżynieria biomedyczna, inżynieria chemiczn…
## 7 Applied Bionics and Biome… 70 automatyka, elektronika i elektrotechnika, …
## 8 Bio-Design and Manufactur… 20 automatyka, elektronika i elektrotechnika, …
## 9 Bioactive Materials 200 inżynieria biomedyczna, inżynieria chemiczn…
## 10 Biocatalysis and Agricult… 70 inżynieria biomedyczna, inżynieria chemiczn…
## # … with 72 more rows
And there we have it, the 82 journals that the Ministry classified as being in Material Science, Chemical Engineering, and Medicine. Now we know where to publish for the next 2 years!
Let’s take a look at the histogram of points:
Mat_Chem_Med %>%
ggplot(aes(x=Pts, y=..count..))+geom_histogram(binwidth = 10)+
scale_x_continuous(breaks=c(0,20,40,70,100,140,200), expand = c(0,0), limits=c(0,215))+
ylab("Number in bin")+xlab("Points")+
scale_y_continuous(expand=c(0,0), limits=c(0,27))+
labs(title="Material Science, Chemical Engineering, and Medicine")
Well, at least a good number are worth 100+ points.
The final thing I wanted to do was to export the new tables back to Excel so my colleagues could benefit from my work. This was very easy using the writexl package.
library(writexl)
#export each table to XLSX each on own sheet
list_of_datasets <- list("Inż Mat" = MatEng_wide_pts, "Inż Chem" = ChemEng_wide_pts, "Inż Mat & Inz Chem" = Mat_Chem_wide_pts, "Inż Mat & N Med" = Mat_Med_wide_pts, "Inż Mat & Inż Chem & N Med" = Mat_Chem_Med)
write_xlsx(list_of_datasets, "wykaz_mat_chem_med.xlsx")
We make a list of the tables we want to export, with names, and then export them all to one Excel file, with each table on a separate sheet. Done!
So that’s a code-through for how I wrangled this spreadsheet to get something useful. I’m sure it’s not the only way to do it, nor the most efficient, but it worked for me and I learned quite a bit through the process. Once I had it done I decided to see how putting together a blog post with R chunks and output would work. It was actually really easy! One of the huge advantages of using R, RStudio, and the tidyverse is that there is a huge, helpful, and friendly community with tons of available resources.
If you want to play with this, you can download my entire R script.
Update
Here’s the simpler and better solution from Duncan Garmonsway:
Thank you 🙏
— Duncan Garmonsway (@nacnudus) August 3, 2019
But I would have used readxl! pic.twitter.com/vpJwsCoJW1
library(readxl)
col_names <- read_excel("wykaz_pub.xlsx", sheet="Czasopisma", skip=1, n_max=3, col_names = FALSE) %>%
mutate_all(replace_na, "") %>%
arrange(2:1) %>% map_chr(paste0, collapse= " ")
discip_table_2 <- read_excel("wykaz_pub.xlsx", sheet="Czasopisma", skip=4, col_names = col_names) %>%
gather("discip", "category", matches("^[0-9].*")) %>%
filter(!is.na(category)) %>%
select(-category)
Here the spreadsheet is first read in using readxl and the column headers and discipline names (with numeric code) are extracted. To deal with the two rows used as headings, the cells are merged together using map_chr. Next, the table is read in again, dropping the first 4 rows to get at the data only and the previously generated column names in col_names are used as column names. gather is used to convert the wide discipline columns with the x’s into a pair of columns: discip and category. Then, filter is used to drop unmarked disciplines and finally the category column is dropped. Once this is done, we have a long version of the spreadsheet, which can be further manipulated in similar fashion as I did above. For example, we can filter and make the table wide, more human readable again.
discip_table_2 %>% group_by(`Lp. `, `Tytuł 1 `, `Punkty `) %>%
summarise(Disciplines = paste(discip, collapse=", ")) %>%
filter(str_detect(str_to_lower(Disciplines), "inżynieria materiałowa"))
## `summarise()` regrouping output by 'Lp. ', 'Tytuł 1 ' (override with `.groups` argument)
## # A tibble: 2,403 x 4
## # Groups: Lp. , Tytuł 1 [2,403]
## `Lp. ` `Tytuł 1 ` `Punkty ` Disciplines
## <dbl> <chr> <dbl> <chr>
## 1 1 2D Materials 140 201 architektura i urbanistyka, …
## 2 6 3D Printing and Additive … 100 201 architektura i urbanistyka, …
## 3 19 AAPS PHARMSCITECH 100 204 inżynieria biomedyczna, 207 …
## 4 20 AATCC Journal of Research 20 204 inżynieria biomedyczna, 207 …
## 5 21 AATCC REVIEW 20 204 inżynieria biomedyczna, 206 …
## 6 69 ACCOUNTS OF CHEMICAL RESE… 200 204 inżynieria biomedyczna, 207 …
## 7 74 ACI MATERIALS JOURNAL 70 204 inżynieria biomedyczna, 205 …
## 8 75 ACI STRUCTURAL JOURNAL 100 204 inżynieria biomedyczna, 206 …
## 9 80 ACM Journal on Emerging T… 70 203 informatyka techniczna i tel…
## 10 106 ACM Transactions on Model… 70 201 architektura i urbanistyka, …
## # … with 2,393 more rows
Here I opt to keep the original row number Lp., the titles, point values and recombine the disciplines into 1 column with the names comma separated. Then filter for Material Science.
Like I said above, the R community on twitter is awesome! I have to add a shout-out to Mara Averick whose numerous tweets have opened my eyes to new packages and things to try and play with.