Bush Library Mechanic aka JCU Library Technologies: February 2014

Sunday, February 9, 2014

VALA Day 1 Wrap Up

The first plenary was delivered by Christine Borgman (UCLA) talking us through the issues around research data management.

Persistent URL: http://www.vala.org.au/vala2014-proceedings/vala2014-plenary-1-borgman

She laid out some cautionary thoughts - librarians thinking they can handle 'data' the way we handle other information bundles is naive in the extreme. She was complimentary of Australia's 'broad' approach to research data management, but a black box approach to data obscures the complexity of what we are dealing with.


A preview of Borgman's upcoming book 'Big Data, Little Data, No Data

No-one can actually define data and even examining specific cases multitudes of complexity occur:
Different fields have different technologies, behaviours, and structures for dealing with, presenting and interpreting data. Different geolocations within a field will have significant variances.

Some of the issues to be dealt with revolve around the personal. Researchers often feel the data they have is their 'dowry' and makes them valuable to their institutions and sharing diminishes its value. Others don't mind sharing but they want personal contact with the person they are sharing for a number of reasons, and feel uncomfortable with unmediated access. On the receiving end researchers don't necessarily like using someone else's data because of issues of trust.

When you start examining the data, or as I coined it 'metareality', even more issues of complexity arise. One neat example was the fact that there are several taxonomies for describing drosophila genomes - how open is the data if it's coded in a way the user doesn't use?

And what about access? At what level of granularity do you provide access? A single link to a zip file of everything? URIs for individual files, or as Borgman mentioned 'A DOI for every cell in a spreadsheet'? What about format? What role do we have in ensuring the format is accessible over time?

What are the effects of knowledge infrastructures (institutional, national, discipline) on authority, influence and power? Recommended reading: http://knowledgeinfrastructures.org/

A strong case was made for skill sets not normally associated with librarians - economics, records management (determining at time of storage how long you're going to store it) and archival practice (provenance is a huge factor in building trust). It's long been said that some areas of JCU assume the Library will take on maintenance of research data management - we really need to be aware of just what is involved and how much we will need to be embedded in research areas as partners and the resources and skills that will demand. In an aside Christine mentioned a 200 page book telling you how simple data citation is!

Hacking the library catalogue: a voyage of discovery

Persistent URL: http://www.vala.org.au/vala2014-proceedings/vala2014-session-2-kreunen

A kind of interesting talk from UniMelb's University Digitisation Centre (UDC) about the work it's 5.4 staff do particularly in trying to manage metadata for scans from the print collections and how they managed to scrape data from the catalogue as automatically as they could. I could relate to the speedbumps they kept hitting with things that should work but don't, for example they figure out how to get their catalogue to present a record in XML, but the scanning software wouldn't accept the XML as valid, but the vendor insisted it was, so workaround on workaround.

Marking up NSW: Wikipedia, newspapers and the State Library

Persistent URL: http://www.vala.org.au/vala2014-proceedings/vala2014-session-2-cootes

A cool project involving the NSW State Library and some interested public libraries that identified 20 volunteers to enrich wikipedia entries for local newspapers that had been digitised and added to Trove. Lots of liaison with Wikipedia because that community is very sensitive lots of changes in a narrow section. So the word was sent out that the volunteers would be doing work and that other contributors should be 'nice' while they found their feet. One interesting observation was that it was 'unlibrarianlike' to have to accept that once you've added content in wikipedia it is no longer 'yours' and you have no veto over how it's changed, any more than any of the thousands of people who actively contribute to wikipedia.

I think we should consider this as another step in processing records for NQHeritage. I think as a profession we should be augmenting and improving wikipedia with reputable resources. As was pointed out: Wikipedia is the 5th most popular site on the web - users are going to use it so lets make it better - and expose our resources through this incredibly popular portal.

Journey into the user experience: creating a library website that's not for librarians

Persistent URL: http://www.vala.org.au/vala2014-proceedings/vala2014-session-4-murdoch

I did a standing ovation at the end of this one. Loved that they went all Steve Krug on usability testing. Loved the heat map (I will have find a way of creating one for our site through the Google Analytics data). Which effectively showed that 3 links were pretty much covering 90+% usage.

Jealous that their university web/marketing section trusted them enough to build templates outside the corporate templates (in fact the library laid the path for the rest of the university.

Check out what they ended up with!
http://www.library.aut.ac.nz/

Influences of technology on collaboration between academics and librarians

Persistent URL: http://www.vala.org.au/vala2014-proceedings/vala2014-session-5-pham

Report about an indepth case study of acadamic/librarian relations revolving around information literacy training and the learning management system at Monash. Pleased to say we have moved beyond some of the problems identified (like lecturers not seeing why a librarian doing IL for their students would need access to the subject in Blackboard) but I think our advances are patchy.

Just accept it! Increasing researcher input into the business of research outputs

Persistent URL: http://www.vala.org.au/vala2014-proceedings/vala2014-session-5-ogle

A report on the University of Newcastles improvements to automating the HERDC reporting and again it was nice to see we were more advanced in this area - their next big project was to integrate their research database with their institutional repository, something our Library, Research Services and ITR sorted a couple of years ago. I wonder what they would have thought of David Beitey's work on the Research Portfolios.

The day was rounded off by a roller coaster presentation 'Social media as an agent of socio-economic change: analytics and applications' by Johan Bollen from Indiana U

Persistent URL: http://www.vala.org.au/vala2014-proceedings/vala2014-plenary-2-bollen

Johan is probably best known for work showing a correlation between twitter 'mood' and stock market movements. Which came out of a project that assigned 'emotion' to tweets through 'big data' analysis applying Affective Norms for English Words (ANEW) which:

"...provides a set of normative emotional ratings for a large number of words in the English language. This set of verbal materials have been rated in terms of pleasure, arousal, and dominance in order to create a standard for use in studies of emotion and attention."

Turns out that 3 days after a spike in negative feeling in the twittersphere the Dow Jones will drop, and a positive spike will correlate with an increase after the same lag.

Bollen was engaging champion for the wisdom of crowds, pulling anecdotes and finding from all over the place to make his convincing case.

1 in 3 people in the world has internet access - North America topping the list with nearly 80% penetration and Africa at the bottom with it below 20% massive recent growth in Asia and the middle East.

If Twitter 'nation' was a nation only China and India would have more people, and analysing linkages and traffic can tell you much about the macro movements resulting from the micro actions of individuals.

Flickr (orange) and Twitter (blue) map created by Eric Fischer

While some think that the internet, by giving everyone a voice, will turn society into an 'idiocracy' Johan does not share this pessimism and gave a couple of examples where crowds do actually make better decisions than experts, and in fact showed us a mathematical formula (Condorcet's jury theorem) that shows a jury made up of people who are right even slightly more than 50% of the time gets more and more accurate the higher the number of jurors. Then pondered whether the current American political situation was because too many people are right just under 50% of the time which by the same formula magnifies the wrongness of the crowd.

Johan had 80 slides both whimsical and insightful, hope the video is up soon.

The finale was his mojito-fueled solution for research funding that eliminated the tortuous and wasteful process of writing and reviewing of proposals. Basically allocate an equal share of the total amount available to all the researchers but require the researchers to give half to a researcher whose work they think is valuable, a bunch of checks would need to be in place, but you would be crowdsourcing research funding allocation to the researchers themselves. The modelling shows that that any 'waste' would be less than the amount of resources spent in maintaining the current proposal/review system. Personally I worried that a standout researcher would receive way more money than required, but I wasn't Mojito assisted. Johan's explanation was much more cogent than mine.

Some links and thoughts to ponder:

Digital Humanities experiment with big data http://www.themacroscope.org/
Mood analysis shows that negative tweeps tend to cluster in networks with each other
In networking terms a retweet is as valuable as the original tweet
Bollen mistrusts Altmetrics because the major drivers are media outlets, not individuals

For another view of all the bits I missed or mucked up try Deborah Fitchett's or Hugh Rundle's blogs.

Tuesday, February 4, 2014

Creating your own ebook #vala14 #bcb

Chris Cormack, with throbbing leg and nauseous belly, delivered a session bursting with so many ideas that I'm still putting together all the resources he provided.

After a greeting in Moari the author of Koha delivered a measured rant that put him more in the Aaron Swartz camp than the big publisher camp. My interpretation of his thesis is that digital publishing completely changes the 'book' business model, and powerful companies are trying desperately to preserve the old system where a book was a non transmittable, nearly non-replicable item of information.

The system has to change, but it's unlikely the publishers will drive the change so Chris in 'Don't die like an octopus die like a shark' mode wants to show people how easy it is to create an ebook, to democratize access to the tools of ebook production. Ebook production is easy - but creating content is just as hard as it's ever been - but publishers don't create content, they control the mechanisms of publishing.

And that was just the start - I'll try and list the tools and thoughts around them.
First up he got us to log into an instance of BookType he'd set up in a couple of hours on a German host - he says any IT department could do it in an afternoon. BookType is the tool of choice of FlossManuals.

So we started on our books after about 5 minutes. While we fiddle Chris kept coming up with related tools and thoughts like

Project Gutenberg has a self-publish option
By handholding our users through tortuous ebook platforms library staff are effectively unpaid 'groomers/promoters' for those publishers
TOR Books don't use DRM (the tool of choice for fabricating scarcity) but are still turning a profit
IE should not be used with BookType (actually the last two words are redundant
Using Calibre to publish your book in different formats (PDF, epub et al)
Using Google Docs to create and collaborate on the 'raw' content then use BookType to format it as an ebook
Libraries offering an inhouse ebook publishing service, a digital 'makerspace'
BookType can import an ebook from the internet archive, simply.
Using http://www.online-convert.com/ to convert between ebook formats (or anything else for that matter)
You can sell your ebooks via WordPress using the free ecommerce plug-in WooCommerce
Or you could host your ebooks on http://dpubs.org/
You can promote them using Open Publications Distribution System http://en.wikipedia.org/wiki/OPDS
Why not get your own Open Book feed reader!
Worried about version control? Flashbake monitors nominated directory creating versions of files http://bitbucketlabs.net/flashbake/
Open Journal Systems http://pkp.sfu.ca/ojs/
If you are gonna licence your stuff CC-BY go CC-BY-SA (share alike) rather than CC-BY-NC (non-commercial) because no-one can define commercial properly in all situations - it will be a barrier to use.
While BookType can be used collaboratively it locks at chapter level - why not use EtherPad and have everyone work on the same page at the same time and see changes in real time!
Want to showcase your other works, well of course there is free eportfolio software too: Mahara
Speaking of alternative cost recovery mechanisms in publishing, ever seen Humble Bundle? It's tag line is 'Pay what you want. Support charity. Get eleven extraordinary audiobooks.' Just giggles check out the weekly sales data to see which OS users pays the most on average (Scroll down on the Games page)

So many toys to play with! Do we have a role in epublishing? I would like to see us collaborate within and without JCU on open etextbook production, and Chris shows that the tools aren't the barrier it's the will and effort.

Monday, February 3, 2014

Brewing Your Own Linked Data #vala15 #bca

Well that session was exactly what I wanted! Open linked data has been on my watch list since 2010 after seeing OpenCalais demoed at that years VALA

It was great to see under the hood and spakfil my limited understanding of what it all meant. I liked that Tim and Paul be aware of 'Linked Open Data (LOD) Paralysis' (see image below) and keep dragging us back to practical uses.

So I now know RDFa is the defacto standard, that Google, Microsoft Bing and Yahoo actually cooperated. Handcrafting your own linked data in html code gave us an object lesson in why you wouldn't only create this sort of markup using computers. But as the 'other' Tim and Paul point out - libraries already have this information stored in databases in a highly structured format, it just that once we publish it to the web we throw the structure out so it looks nice on a page. Generating this same content with embedded Linked Open Data (LOD) from these databases shouldn't be hard.

LOD maintains the structure of the data (so it makes sense to computers) without changing how it looks (which makes sense to humans).

'But why would we care what those dumb old computers think? ' I hear you ask. Because once computers understand that a particular text string is a person, building, organisation or tea cozy, with a bunch of attributes, that may also be shared with other 'things' we make clearly visible connections that are invisible or buried to us with unstructured information.

It's early days but as LOD becomes the default the possibilities expand. If you want to see an example of it search google - the panel on the right hand side is pulling that information from various sites using LOD.

The other take out was search engines prefer LOD enriched sites, paying attention to content that was just a long string of characters to be ignored by Google (with 'normal' pages Google mostly relies on HTML title tag. LOD enhanced content opens up discovery.

Using the resource tag is like a web-based authority list. If you link a data element to verified source your item is automatically linked to every other data element anyone else has linked to that same resource. Which in turn expands the number of 'attributes' that item has. Bear with me with this tortuous example. But say you link a publication's author to a verified linked data target. And say someone else links a player in a list of former Seattle Seahawks greats to that same person. Then without any human intervention a third party locating the book can see that the author once played in the Superb Owl (HT @StephenAtHome)

Paul and Tim list a bucket load of resources here: http://t.co/RiIlQk4s8J

dbpedia is the semantic web mirror of wikipedia and common datastore the resource tag is pointed at. It might help to get a sense of just how much information is available about a 'thing' but just taking a look at the entry for J.K Rowling (courtesy of the Fantales exercise) scroll down to see the extent of informational attributes recorded for the author.

What does it mean for us? Well the biggest buckets of structured data we control directly are our catalogue and eprints - neither of the frontends we provide for them (Tropicat & ResearchOnline@JCU) currently have any facility for integrating LOD - but perhaps we can encourage developments in that area from the vendors, likewise Summon might be pushed in that direction, and Tim and Paul were clear that Trove was moving toward increasing LOD - initially driven because 76% of Trove referrals come from Google, and LOD promotes your resources in Google. Trove harvests our eprints through OAI-PMH (which is highly structured XML) so we benefit from their work. Also on my wish list is LOD capability in any new CMS JCU acquires.