Images of the arXiv: Reconfiguring large scientific image datasets

,


Introduction
In an ongoing research project 1 on imaging and machine learning, we have been concerned with how the ascendancy of statistical visual forms -particularly from the 1990s on 2 -has come to transform and reorganise images so as to 're draw' knowledge about empirical phenomena. Historians and science studies researchers have long established the generative rather than simply illustrative role of images and figures within scientific practice. 3 More recently, the use of images within computer science specifically has garnered attention: "[they] often contain a crucial part of the information scholarly documents convey. Au thors frequently use figures to compare their work to previous work, to convey the quantitative results of their experiments, or to provide visual aids to help J O U R N A L O F C U LT U R A L A N A LY T I C S We set out using an empirical and exploratory approach by suspending the sep aration of the performative from the reproductive scientific image. We track, instead, observable differences in how images circulate through scientific re search. By treating all forms of visual material found in scientific publications -whether diagrams, photographs, or instrument data -as bare images, we devel oped methods for tracking their movements across a range of scientific research. In many ways this is made possible by the labelling of images within scientific publications as 'figures.' Accordingly we treated all material appearing under the 'figure' caption as such 'bare' images. As we tracked these movements, we were better able to ask: what specific forms do the 'output' images, produced via a range of techniques, including statistical ones, assume? In what ways is the input (image) data transformed so as to become specific kinds of visual figures, newly configuring and indexing empirical phenomena? Moreover, we also sought to ask: how might we come to see the circulation of such visual figures across the vast corpus of scientific work? In this article, we detail the dataset and statistical exploratory methods deployed to trace this circulation of visual forms. We suggest that such methods allow us different entry points into large scientific image datasets and that they initiate a new set of questions about how scientific representation might be operating at morethanhuman scale.
Before we outline and discuss these methods, a couple of notes concerning the conceptual and theoretical background against which we have developed these methods. First, we think about scientific images in the context of the well developed discussion of 'inscriptions' in science studies. 8 Here images that are generated (and other techniques for inscribing scientific information includ ing publishing), and that appear and circulate throughout all modes of scientific practice, do not simply represent data or 'facts.' Their visual nature or indexi cal status is not what is primarily of importance. Instead inscriptions primarily perform two functions: the mobilisation of scientific objects among scientific communities of practice; and the consolidation and authorisation of the scien tificity of such objects. This gives inscriptions a paradoxical status of being both movershakers and stabilisers of scientific research and practice. Much work has been done in visual science studies to look at how specific kinds of imaging techniques and technologies work to authorise claims about scientific data but, at the same time, destabilise and challenge knowledge formation. 9 In our work, we similarly focus on what distributed circulation across platforms and networks, as a peculiarly contemporary mode of mobilisation, might tell us about the current status and recent history of a heterogeneous array of statistical visual forms. We are less concerned, then, with what these forms empirically represent than in what operations they perform in bringing empirical phenomena into the dispersed infrastructures of scientific knowing -repositories, preprints, formatting of image files, to name only a few.
However, we are not primarily concerned with how images circulate within a specified community of scientific practice nor in how such images/inscriptions help to sustain such communities. In part, this is because in tracing the circula tion of images across the sciences we enter an arena in which the vast quanti ties of such images moving online and across platforms do so at a scale before and beyond specific institutional and labbased communities of scientific prac tice. 10 This migratory massification of images had already been heralded by the production of large commercially available GIS datasets from the 1980s on ward. But the scale and availability of public image datasets and their use in ML practices such as image recognition challenges really took off in the 2000s. Im portantly, then, we are no longer talking about how inscriptions are mobilised but rather how an entire assemblage -the dataset -becomes an operative appa ratus. In this contemporary situation, we look to a critical, even post, digital humanities 11 for experimental approaches to large scientific datasets. We de ploy computational methods that draw on ML itself but that also allow us to develop a situated and interrogative perspective on the circulating, exchanging, and associative tendencies of large aggregations and flows of statistical image forms.

(The making of) a large scientific image dataset
A crucial prerequisite of this work has been to generate a dataset of images drawn from scientific research that manifests the knowledge claims about empir ical phenomena. Such knowledge claims are now entangled with the authority that statistical approaches have found in sciences, which also deploy data meth ods. 12 How do these statistical approaches also make their way into images? We built a dataset of images drawn from all preprint articles deposited in the open access repository arXiv from 1991 (its inception) until the end of 2018. 13 ArXiv (pronounced "archive") maintains over 1.5 million eprint articles across a diverse range of scientific fields. Created in 1991 by physicist Paul Ginsparg, arXiv provides a platform for authors to share articles prior to or during the pro cess of peerreview. 14 Researchers in many fields, such as high energy physics, rely on arXiv as their primary source for accessing current scholarship. 15 The number of submissions to arXiv has been growing in recent years, with 140,242 new articles added to the repository in 2018. 16 ArXiv stores papers across a range of knowledge domains from the sciences including but not limited to: physics, mathematics, statistics, computer science and subfields of biology and economics. 17 Its development as a repository also maps many key developments for images and imagerelated tasks that use sta tistical approaches, in particular ML. To name only a few: the shift within as trophysics to data management and the classification, segmentation and recog nition of objects across a new scale of image datasets enabled by the launch of the Hubble Telescope in 1990; the release of the MNIST handwriting dataset in 1998; the creation of ImageNet in 2009; Google Brain's 2012 deep neural net work architecture for recognising cat images from unlabeled images taken from frames of YouTube videos; and the 2016 AlphaGo model, trained on millions of 19x19 pixel images of Go board states. The images of arXiv -as elements within the research papers of these and many more ML projects and develop ments -document a growing statistical visuality, as a technical practice which is concerned with the statistical observation of something in data.
ArXiv provides a relatively accessible bulk download procedure for its arti cles. 18 From this bulk download, we were able to access all papers (and as sociated images) uploaded to arXiv from its inception to the end of 2018. Arti cles uploaded to arXiv consist of either: solely a PDF file (7.69%); TeX/LaTeX source code only (21.95%); or TeX/LaTeX source code with figures as sepa rate image files (70.32%). 19 ArXiv provides multiple modes of accessing the repository: to the articles themselves via a web interface with download links; to their metadata via an OAI2 interface; and to bulk source or PDF downloads via Amazon Web Services. We used the OAI2 metadata and the bulk source data together to construct our dataset with metadata indexing. As an open access repository status within scientific publishing and research distribution -partic ularly within fields that might have interest in source material for large datasets such as computer science and statistics -arXiv has not unsurprisingly already been used as source data. However, prior dataset engagement has almost exclu sively used its text data, deploying this for citation tracking, natural language processing, or topic tracking and prediction. 20 Across the arXiv dataset there is a mean average of over seven images per paper. Yet very little research has been conducted into using arXiv as a source for querying or analysing its image data. 21 Completing the download in January 2019 provided us with over 10 million im ages for our arXiv image dataset. What could an image collection structured as a dataset comprising diagrams, photographs and other graphic elements drawn from the range of scientific fields present in arXiv tell us about how statistical computing practices such as ML reconfigure ways of seeing and knowing real world phenomena? How might we slice through a highly diverse and volumi nous image dataset, engaging with its emergent forms of organising and cir culating statistical image 'types' without simply classifying or segmenting the images according to an uninterrogated taxonomy of visual forms: photographs, scatterplots, feature maps, bar graphs and so forth? The rationale behind the decision to use arXiv was that it would provide us with a dataset that contained a recent historical record of the ways in which images have figured in scientific Figure 1: Montage of 16 images sampled randomly from the arXiv image dataset. Images have been resized to fit within a 480x480 pixel square. Seen here are a number of plots, charts and graphs, as well as other data visualisations, processed sensor imagery and computer synthesised images. * Source materials uploaded to arXiv are either given a specific licence (such as creative com mons) or use the standard arXiv distribution licence, which grants arXiv permission to dis tribute the article/images but the creator retains copyright. This licence remains with the im ages when downloaded and used by other parties. This is particularly important as the authors of a paper may not have the rights to distribute the images, such as in the case of photographs that are being processed by computer vision algorithms. In the interest of properly accrediting the creators of these images, we have created a website which provides accreditation for each image at https://github.com/reimaging/reimaging/blob/master/methods/credits.org. scholarship broadly, and ML research specifically. We posited that ML tech niques, used in an exploratory and interrogative mode, might provide a means for experimenting with the tendencies and relations of scientific images during a recent historical period in which ML has itself widely affected the produc tion of scientific knowledge in many fields. Additionally, we were interested in the availability of the metadata uploaded through the arXiv submission process. This included standard information about author, article abstract and so forth but also detailed information about the software used to create images. We posited that examining this metadata through dataset querying could unfold information about the situated knowledge contexts out of which these images arise.
We queried the arXiv image dataset according to some of the categories under which preprints were submitted -for example, 'cs.CV,' arXiv's abbreviation for computer vision. The purpose of this was to explore how over the 28 years of arXiv's growth as a repository the images being used within an area of sci entific research communication might reveal changes and continuities in the visual cultures of these scientific endeavours. As we have already noted, scien tific imaging has undergone many changes over three decades. However these changes are not simply the result of new technological developments within a particular field of research. Rather, we have witnessed an entirely new approach to the organisation, deployment and operations of images in the sciences. While there has been a huge uptake of images as data within varied areas of scientific research, there has been little investigation by either STS or visual studies as to what sociotechnical relations or implications this might have. 22 In our explo ration of the arXiv dataset, we were especially drawn right from the start to the category of computer vision as an area of increased publication submission by arXiv authors, which historically tallies with both the rise of this as a research field; and as one concerned with and deploying statistical imaging processes and methods.
It is clear that within articles communicating scientific research such as those found within arXiv, images cannot simply be understood instrumentally or illus tratively. 23 Specific images within our dataset circulate and are reused across the period of arXiv's development, perhaps suggesting that only partial 'views' of the world may circulate throughout specific communities of practice. Look ing at the images within categories of publication such as computer vision within arXiv gives some indications of imaging practices within the empirical sciences and mathematics that have steadily adopted statistical approaches over the last 28 years. It certainly does not provide a definitive contouring of a singular vi sual culture or mode of working with images as a generalised tendency within research that utilises statistical imaging. But it can highlight some of the ten dencies, frictions, and commonalities across seemingly diverse research areas.
The images within the bulk download of arXiv preprint manuscripts lack meta data annotating their scientific function within the preprint. Although the preprint publications have been structured and organised in particular ways -according to the authors' uploads, arXiv guides and requirements, subject categories, and disciplinespecific customs, which we detail later in this article -the images within them yield a heterogeneous aggregate that renders their (scientific) func tion opaque. They contain artifacts of experiments or observations, images cap tured by sensors, images found in ML datasets and images that are the outputs of ML processing, diagrams and drawings conveying the methods and experi mental results undertaken, and diagrams/images that are transformations of the data that convey the 'results' of research.
We hope here to convey some sense of how a 'workable' image dataset can be generated from its source data. We want to give an account, too, of the prob lems and frictions that present themselves when trying to access, organise and query this data. We also discuss our experiments with various ML techniques, repurposed here to map relations of proximity (similarity) and distance (differ ence) in the arXiv image dataset. We are interested in whether these relations render clusters of images that can be associated with different imaging styles or forms that then map on to specific knowledge domains signalled by arXiv's 'categories.' This can then be used to compare with, for example, increases in quantities of images per article within a particular category of arXiv publication. It should be stressed that this only produces a correlation: increased quantities of images published in a particular category such as computer vision with a po tential shift in kinds of images in that same category, for example more sensor based images appearing in that category over time (this is further discussed be low). But before getting to this point we paid attention to quantitatively and qualitatively relating the arXiv image dataset to the metadata accompanying the preprint articles. This allowed us to observe the formatting and generative software deployed in the production of images in the (hard) sciences. Further to this, we have been interested in the distribution of types of images across the entire preprints downloaded and the distribution of such types within particular arXiv knowledge/discipline categories. By types, we mean a pragmatic ternary classification we applied to the dataset using human visual analysis and ML pro cesses. This process distinguished between: 'diagrams,' 'sensorbased images,' and 'hybrid diagramsensor' or 'mixed' images.
We are proposing that this conjunction of data points and techniques produces potential amplifications, resonances, and disjunctions across a large scientific image dataset, suggesting tendencies in their distribution and circulation across domains of scientific knowledge and practice. In downloading, organising, con verting, querying, and sampling from this dataset we also highlight the particular challenges and difficulties of working with any large dataset. By outlining the steps taken to work with and gain some understanding of the data, we hope to give an account of some dimensions of data research that may often be rendered invisible, opaque or even unworthy of attention in many domains producing knowledge through their use of 'big data. ' We consider all aspects of dataset methods important contributors to data work, and are interested in elucidating the various exploratory data practices and processes that emerge when working with data from the bottom up.
Without denying the significant communicative functions of scientific images, we also consider how images in aggregate, as a large collection, might do some thing more. The dataset presents a number of characteristics that we believe inhibit widespread use or distribution as a dataset. First, the dataset is large (2.1 terabytes across 1.5 million article folders) and requires a number of steps to acquire and organise, while also being difficult to sample data subsets. Second, the formats of its images are inconsistent. There are a wide variety of image formats ranging from standard photographic (TIFF, JPEG) and web formats (PNG, GIF) to vector graphics (SVG), postscript (EPS, PS). Third, images in the dataset have varying dimensions, ratios, and image quality. Lastly, not all images that are used in arXiv preprints are available in source form and cannot be consistently retrieved from PDF articles. We outline some steps we have taken to mitigate and/or work within these constraints.

Structuring arXiv images as a dataset
The resources, time, and effort required to generate and structure new large datasets from bottom up are frequently a barrier to the diversity of datasets in, especially, ML research. Hence certain standardised image datasets tend to become benchmarks by default. 24 Instead of using a standard dataset, we em barked on creating this dataset with the idea that by engaging with actual data practices and ML techniques for images, we would, step by step, learn more of the computational operations with which data science engages images and vision. 25   produce its facial image dataset 'Diversity in Faces' is a case in hand. 27 The PDF format also tells a story about the circulation and distribution of image data (or for that matter, text, fillable forms, popup comments, video and audio material, semantic tags and so on). The key characteristics of the PDF format in the context of this research is its standardisation, which makes it crossplatform reliable and queryable. 28 What we would suggest, then, is that the shift in for matting to both webbased and standardised formats for images in arXiv from 2010 onwards indexes and contributes to a broader shift in the status of im ages as potential data source. That is to say, while image content and its com municative function retains a degree of importance for scientific publication, they by no means exhaust the productive capacity of arXiv's images. Instead, arXiv's image formats attest to images as vectors for mass collection, exchange, distribution, and storage. In another context, Jonathan Sterne has coined the term 'perceptual technics' to foreground how media channels and storage can be economised so as to grab corporate share of a perceptual modality such as hearing. 29 The shift of image formatting to webbased and crossplatform also charts the transformation of seeing images from something that is humanbased to a form of seeing that is now organised via operations of data processing. In our approach to structuring arXiv images as a dataset, we have spent time on aspects such as data extraction and sampling in order to see how data points might resonate with sociotechnical events.
ArXiv bulk source data is separate from article metadata, requiring additional steps to link image data to article metadata such as author, category, or publi cation date. In order to index the images with their associated article metadata, we created a SQLite database with three tables: metadata with rows for each article; images with rows for each image file (see table 2); and captions for text related directly to images. For metadata, we used a primary key of a unique number, and inserted the identifier, date created, categories, authors, title, ab stract, and licence. The vast majority of image files within arXiv contain metadata relating to the software that created or last exported that image. For each image in the dataset, we ran a query using exiftool, which reads the EXIF (Exchangeable Image File Format) metadata. After some initial testing, it was found each image gener ally had data written in either the "Creator," "Software," "Comment" or "Desc" field that related to software or applications used to edit or create the image and so this was also added. Once both tables have been created, it is possible to per form SQL queries that pair the associated metadata with a given image. This allows us to create queries and perform analyses where the image data can be linked to subject categories or date. 30

Querying and analysing the image dataset
ArXiv provides some interesting statistics concerning submission to the repos itory over time and by category. 31 We supplement these with statistics that are specifically about the images held in the repository (see table 3). We have also used arXiv user submission statistics in the choices we made in the image sam pling. For example, noting the peak periods of publication dates across October and November in 2018, we used articles submitted in October 2018 from which to sample a larger range of images. At a later point in this project, it may be possible to correlate or compare data points across years and categories more widely.   Prior to thinking about the distribution of images across categories in arXiv, we wanted to get a sense of the overall rates and quantities of preprints being sub mitted in different disciplines (categories) across the whole of arXiv's life as a repository. Figure 3 shows the number of articles in each primary category, gleaned from querying the arXiv bulk download. There are four categories with significantly higher numbers of articles: hepph (Physics: High Energy Physics -Phenomenology), astroph (Physics: Astrophysics), hepth (Physics: High Energy Physics -Theory), and quantph (Physics: Quantum Physics). This is unsurprising given the origins of arXiv, which began as a physics preprint repos itory. ArXiv's own statistics demonstrate the high participation of these disci plines in submission to the platform. Paper submissions are primarily spread between physics, computer science, and mathematics, with the incoming rate for 2019 at 40.9% physics, 27.8% cs, and 22.5% math. 32 Figure 4 shows the relative percentage of publications submitted in a given cat egory across years from 19912018. This figure shows how certain fields have remained relatively consistent, such as many of the math disciplines, whereas other fields spike or grow exponentially in recent years. Computer Science dis ciplines appear to have the largest growth in the late 2010s, shown by cs.CV (Computer Science: Computer Vision), cs.
LG (Computer Science: Machine Learning) or cs.RO (Computer Science: Robotics). Figure 4 shows only rel ative growth and is of course highly dependent on a number of factors such as the overall use of arXiv and the specific discipline participation habits for sharing preprints via this platform. However, it indicates a trend of increased preprint publishing in several disciplines that deploy or target ML in and for their research. Figure 5 shows the average number of images per article for the top16 cate gories, ordered by largest number of images. There is significant growth in the number of images, particularly in recent years. The most striking subplot is the exponential increase in the number of images per article in preprints sub mitted to cs.CV (Computer Science: Computer Vision), as well as a significant upwards trend for average number of images in preprints in cs.
LG (Computer Science: Machine Learning). We could speculate that this is connected to the  ways in which imagerelated and based work have begun to feature more promi nently in tasks such as classification, recognition, and analytics, part of the suite of computational and data operations that now permeate and define fields such as computer vision. There is also an upwards trend in categories such as astro ph.GA (Astrophysics of Galaxies), which may also be related to an adoption of similar image processing practices. 33 The most significant results shown here are large increases in image numbers in cs.CV, cs.
LG and stat.ML, especially from 2013 onwards. More work would be required to trace the changes in im age types and numbers of images, especially as they relate to ML within nonCS disciplines that may have adopted these practices.
Overall, we gained two major insights from our querying of the arXiv image dataset. First, the formats and metadata associated with these images provide an awareness of what other computational processes images are subjected to, where they circulate across platforms and networks and what computational image practices are deployed by different scientific communities. Second, by looking at the quantities of image data in relation to the categories or knowledge domains in which they are submitted for preprint, together with information about formatting and software used to create images, we begin to get a sense of a meshwork that enfolds image production, exchange and circulation. This meshwork underpins the practices of sciences at the level of computational and platform infrastructure. Although not 'located' anywhere in the sense in which we ordinarily understand infrastructure, nonetheless this meshwork supports the increasing prominence of images as 'data' of and for scientific research.

Tendencies of the image dataset
In order to initially gain a sense of what broader image forms might be pop ulating the arXiv image dataset, we randomly sampled subsets of 144 images from different categories and years. These were then analysed and compared for variations in image types and forms. As with any dataset of this scale, it is difficult and timeconsuming to physically look at a large number of sam ples. This presents an interesting challenge to an investigation of the aesthetic tendencies of such image forms. Our approach has been to look at a number of randomly sampled images from across the whole dataset and from category subsets of cs.CV, stat.ML, and cs.AI. These latter were chosen since preprints submitted to these categories use statistical computational imaging techniques such as machine and deep learning.
The images in figure 6 demonstrate a wide variety of image formats and styles. While charts and graphs dominate, there is much diversity within these forms. For example, there are a number of bar graphs that use different colours, spatial arrangements, borders and layouts (such as the placement of a key). Scatter plots similarly take many different forms. There are also a number of heatmaps that stand out in their use of strong gradations of intense colours. We broadly The images sampled from the primary category of cs.CV (Computer Science: Computer Vision, figure 7) display a different distribution of image forms. 35 Here we see many more photographic images, images of drawings or artworks, medical imaging, maps, satellite imagery, and segmentation maps. A number of the images appear to have been processed for edge detection, quality reduc tion, or have overlays added. There are a number of diagrams (graphs, charts, plots, and heatmaps) although far less than in the random sample seen in fig  ure 6. Many of the images here appear to show images found in common ML J O U R N A L O F C U LT U R A L A N A LY T I C S datasets such as MNIST/Fashion MNIST. 36 Images from MNIST show hand written digits, while Fashion MNIST shows garments, both in low resolution, grayscale images. In this sample there is a handwritten number "2", and images that appear to show sweaters, shirts, and shoes. These images are likely to have had some additional computer vision processing applied. 37 Through manually labelling a training dataset and then using the trained classifier, we estimate that approximately 50% of the images in this subset from cs.CV could be conceived of as a diagram, compared with 91.9% across the entire dataset (we detail this further below). Overall the images appearing in this sample seem to indicate vi sual practices specific to computer vision knowledge production, where source images are data inputs; and to its modes of knowledge communication, where image overlays are demonstrative of something being evidenced.
The sample in figure 8 from stat.ML shows diagrams to comprise approximately 80% of image types. The whole sample set features similar forms as cs.CV including graphs and charts, as well as photographic images with overlays. A number of images appear as multiples, such as the images of traffic intersections, which we have observed occurring across subject categories such as cs.CV and stat.ML. Compared to the cs.CV sample, there is a much higher proportion of diagrams and these appear to have more variation in their forms -bar graph, plot, flow chart, heatmap, tree diagram, scatterplot, 3d plot and so forth. The montage of images from cs.AI in figure 9 shows an even higher proportion of diagram images. From labelling each image in the montage manually and calculating totals, diagrams make up~90% of the total images. There are only a few images that may have involved a (photographic) sensor. The diagrams are scatterplots, geometric schematics, bar graphs, heatmaps, 3d plots, flow charts and logic diagrams. In cs.AI, diagrammatic images could indicate the gathering of data into visually demonstrative forms; potentially suggesting a propensity to visually explain or communicate AI architectures.
The majority of images across the dataset as a whole could be considered as some sort of diagram; that is, a set of lines that draw parts, whole, elements, and relations as processes, systems, apparatuses or other assemblages. Within the diagram subset, there are a number of forms that stand out such as graphs, charts, scatterplots and heatmaps. Many of these forms gather and compress ex perimental measurements or observations into a graphic form whose contours, spatial relations, labelling, and color indicate trends, correlations, groupings, or patterns. Drawing on other broadly pragmatic uses of the term from sci entific literature we bring these under the umbrella of 'diagram': "Diagrams are graphic representations used to explain the relationships and connections between the parts they illustrate." 38 There is a smaller percentage of images within the dataset that appear to have been captured by a sensor such as a cam era. These include photographs or other kinds of sensorproduced images but also include overlays such as bounding boxes or motion arrows. A number of images within the dataset could be considered as mathematical formulae, pri marily composed of symbols and equations. We assume that these appear where the authors did not choose to use TeX/LaTeX for their typesetting but instead created the images and inserted them into their documents separately. Finally, there are images such as 3D renderings, or images that are composite overlays of partsensor and partdiagram that do not fall into either the diagram or sensor based type. We call such images 'mixed.' For most subject categories in the arXiv, there are a large proportion of diagrams and a much smaller proportion of images that may have been produced by sensors and mathematical figures as we indicate in figures 6 to 9.
To estimate the ratios of these different types of images, we trained a neural network classifier that would predict whether an image most closely resembled a 'diagram,' 'sensor' or 'mixed.' 39 To build a training dataset, we manually labelled 9748 randomly sampled images. The decision to create a ternary clas sifier with a 'mixed' category was intended to capture and highlight the varied operations being performed by images across different kinds of knowledge do mains. For example, in our initial querying of image distribution in the dataset, we noticed clusters of images that appeared to be somehow sensorgenerated yet reorganised through graphic overlays that seemed to suggest they were being set within a diagrammatic schema. For our purposes, then, a ternary labelling and classification of the images allowed us to view the shifting ratios of image types across arXiv. Our training set produced: 8649 (88.7%) images as 'dia gram,' 477 (4.89%) as 'sensor' and 622 (6.38%) images as 'mixed.' We then trained a VGG16 classifier using the labelled data. 40 This model was then used on new data inputs in order to give a rough estimation of the chang ing distribution of sensordiagram images. Our ternary classifier achieves an accuracy of 92.07% on the crossvalidation dataset. On unseen samples, the classifier predicts 91.9% of images to be diagrams, 2.1% of images to be sen sor, and 6.0% to be mixed. This indicates that the classifier is getting similar results to our HIT labels, but the reduction in sensor images in a label with limited data may also point to overfitting the training data, with an increased tendency to predict either diagram or mixed. This process of classification was used to inform our choices of arXiv categories to look at more deeply and to query the features of images across time in specific fields of scientific research. Since this is all highly dependent on the labelling process, random image sam pling, and training of the network, these results (shown in figure 10) are only a preliminary guide for further research and inquiry. Interestingly for our concerns, the category of cs.CV shows an increasing trend in the last decade away from diagramtype images towards an almost even three way split between diagram, sensor, and mixed, as seen in Figure 10. As such, cs.CV is moving towards a much higher ratio of sensorbased images than com parative categories. The stat.ML category exhibits a similar trend, although not to the same degree. Categories such as astroph and related astrophysics fields are predicted to contain a majority of diagram images, remaining consistent across the time period sampled. Categories such as nlin.CG (Physics: Cellular Automata and Lattice Gases) and cs.GR (Computer Science: Graphics) are pre dicted as containing a high proportion of 'mixed' images. The aforementioned cs.CV category is predicted as having the highest proportion of 'sensor' images, unsurprising given the content and source material that computer vision often works with.
Due to the enormous number of images, we chose to limit our initial inquiry to only a few categories. Here, we specifically focus on looking at images from cs.CV (Computer Science: Computer Vision and Pattern Recognition), cs.AI (Computer Science: Artificial Intelligence), and stat.ML (Statistics: Machine Learning). 41 We expected that cs.CV and stat.ML would have increased num bers of images in the time following significant ML, deep learning, and image processing developments such as ILSVRC 2012, when a deep neural network was used to significantly improve upon previous results. 42 The initial metadata statistics and image form analysis confirms this and shows that there is an in crease in images being submitted in preprints in these categories. Additionally, there is a shift away from the use of diagrams in the preprints in these cate gories and towards other image types (both sensorbased and 'mixed' images) becoming more prevalent.
The montages and classifier results give us a sense of the quantities, types and proportions of images/types we might find in specific categories of arXiv im ages, but they do not show us anything further about the relations of these im ages to each other or their circulation throughout the corpus of research that comprises arXiv. We wanted to get a sense of how images, whether diagrams, sensorbased or hybrids of the two, move in and out of proximity and/or dis tance to each other. We aimed also to explore preoccupations with a particular type of image within a category or domain of knowledge production. Querying and providing a 2D mapping of the ML methods we have used to think about the images of arXiv shows the distribution of such images according to cate gories -or domains of knowledge production in the sciences -via relations of imagecentric proximity to and distance from each other. A classifier is used in data science to group members into separate classes in which the class denotes a consistent similarity of some particular thing. We have used a form of super vised learning in which we initially trained our classifier according to a set of 3 classes that were manually labelled: that is, the classes of images belonging to 'diagram', 'sensorbased' or 'mixed'. Once trained the classifier was then run over the publication submission history of particular arXiv categories. Another ML method for looking at spatial distribution of similarity and differences via clustering of data is the tSNE algorithm. tSNE does this by performing dimen sionality reduction and placing all images in a 2dimensional spatial mapping, which can then be analysed visually by human eyes. We ran a tSNE algorithm across various subsets of images in order to look for patterns of similarity, without any explicit reference to disciplinary bound aries. 43 We took a VGG16 CNN model, pretrained on ImageNet and used the second last fully connected layer to obtain image features. 44 This provides a 4096dimension feature vector for each image -much smaller than the original pixel data but embedded with specific features of the classifier network. Princi ple component analysis, a standard process in ML dimensionality reduction, is then applied to this feature vector to reduce the size of the vector and remove redundancy. This allows us to convert the 4096dimension vector to 300 di mensions, while retaining almost all of the variance. tSNE is then used to find twodimensional coordinates for each of these image vectors, which can then be used to plot the original images in 2dimensional space. It does this by iter atively calculating the nearest neighbours for each data point (each image) and reorganising the two dimensional output until the data is placed optimally. 45 The results of this process create vectorised observations drawn from the fea tures of the images such as textures, colours, and contours. We can glean the distribution of image forms from the proximity and clustering of images across the particular categories/years/formats being queried. In the tSNE mapping of cs.CV images in figure 11, there are small separate periph eral image clusters that seem to group according to colour content or shapes within the image. Images with white backgrounds such as graphs and charts appear mainly in the upper half of the distribution, separated from the predom inantly greyscale images that fall mainly across the lower half. A cluster of the same photograph of a man looking through a camera (a publicdomain test image often used in MATLAB, titled "cameraman" 46 ) appears at the centre of the left side as a distinct group (circled in figure 11). Two main vectors domi nate the distribution of images here: the separation of some images into distinct groupings (or clusters) in which repeated image content or repeated dominant features are found; and the overall organisation of the larger group of images according to colour content. Here the central cluster moves top down from pre dominantly white background images, to some distinct colours, to images that are largely gray, and then finally to images that are mostly muted colours. This also coincides with our quantitative findings on ratios of image types within stat.ML, shown in figure 10. In 2012, this category had a much higher propor tion of diagramtype images, appearing in the tSNE visualisation as the large spread of whitebackground images on the entire lefthand side. There is a much smaller proportion of photographic images, towards the centreright, including the "cameraman" image at the far right edge.
The clustering found in the tSNE images shows that features acquired through the VGG16 pre trained classifier are working to seriate images along specific vectors. Here larger image types that we have classified as diagrams, sensor based, and hybridised diagramsensors (or 'mixed') drive a macroobservation of such vectors. But zooming in on, for example, the stat.ML tSNE and focus ing on the diagram images (see figure 12), we can see other kinds of distribution relations emerging between the diagram forms themselves. In the figure 13 de tail, features such as slope of lines, repetition and comparative repetition, small multiples, and colour have a generative effect on clustering. For example, the grouping of images in the topright of this detail mostly involves small mul tiples of comparative graphs, while the images towards the centreleft mostly show logarithmic curves. But also interestingly, the further one zooms in to the tSNE, the more the distinctiveness of clustering as definitive diagram form falls off. Instead all the clusters tend toward all other clusters in their proximity. A patterning of diagram relations at the level of variable similarities emerges. From this, we propose that the ML and deep learning techniques we use func tion to observe a transversal plane in the image arXiv dataset. Moving across this transversal plane affords a different way of seeing image relations, tenden cies and potentially their associations with scientific communities of practice. tSNE and VGG processes traverse the plane without predicating any represen tational or indexical function. Yet they do more than act as simple quantitative segmentations of the image data. They offer us tendencies that images as forms might be following in relation to a large corpus of scientific research and how such forms might be moving in relation to each other and the epistemological endeavours in which they are engaged.

Conclusion
Exploratory approaches to large image datasets can be of interest to data scien tists and to transdisciplinary researchers in the postdigital humanities, software studies, and critical AI studies. They are especially resonant in contextualising the transcontextual flows of changing visual techniques as they percolate across scientific fields and subfields. It is somewhat ironic that in aggregating a large image dataset, that incorporates images from the very sciences most focused on their labelling and recognition, we discover the lively 'resistance' of such images to classification. This 'resistance' lies not simply at the level of size and format -typically understood as data cleaning problems -but at the interlacing of image types found in preprint publishable scientific research. One lesson of the arXiv image dataset, a dataset rooted in epistemically privileged contem porary sciences such as physics, computer science, mathematics, and statistics, is that categories do not necessarily map onto image form; indeed they might mask the meshwork of images. The endemic diagramtype images in arXiv con founds current database ontologies that are maximised for specific ML and deep learning tasks such as face and object recognition, making 'scientific images' in their heterogeneity themselves largely unclassifiable with current tools. Our explorations have led us someway down a path toward a rough ternary mode of classification of images of arXiv. But we have also discovered that there seems to be no standard statistical method for segmenting their image 'forms' because the forms present continuous variation across classificatory boundaries. If we have asked why no attention has been paid to such images then our explorations demonstrate that more attention is due.
In making image formatting a point of interest in our exploration of the arXiv image dataset metadata, it has been possible to speculate about the relation of formatting to other recent trends in the relations of images to little remarked visual practices of ML and computation more generally: webscraping for data collection and increased crossplatform circulation of images particularly stand out. Ongoing research in this project will continue to link aspects of the meta data such as the 'creator' information to keyword searches performed on the image captions. We hope this might render some relations between the mode of generating images (programmatically or manually, for instance) and images that are direct outputs of MLoriented research. This, we suspect, may begin to circumscribe what ML as a research practice 'looks like.' An empirically exploratory approach to large (scientific) image datasets is not contextfree, but addresses the challenge of crafting ways to engage with the circulatory visual cultures of the sciences, especially those that are MLfocussed or associated. Our research and approach is interested as much in the dataset relations ML techniques reconfigure as the 'imagistic' nature of contemporary science image data. Our 'experiments' in assembling a large image dataset and our visual/ML observations of it have been recursive, often involving a return to running a new process on earlier quantitative findings. This has allowed us to remain 'open' to the dataset itself, while acknowledging that each return is also a computational incursion into and reconfiguration of the image data itself. This is not so unlike any work on large datasets and the building of ML or deep learning models. Data work today involves: running many iterations or 'epochs'; constant tweaks and optimisations of data and parameters; and the inevitable discovery that predictions expose over and underfitting. We suggest that an experimental embrace of this recursivity might also facilitate knowledge of another kind of visuality for contemporary scientific images, that takes into account their immanent circulatory relations and the technical infrastructures and operation through which they circulate.