Case: Bibdata.dk

Bibdata.dk er et eksempel på hvordan man kan eksponere biblioteksdata, så søgemaskinerobotterne fanger dem, – minder meget landing page funktionaliteten til bibliotek.dk

  • Beskrivelse
    • Blot metadata + anbefalinger
    • Schema.org som microdata
    • Open Graph som RDFa
    • Manifest og værk (har også haft eksperimenter med side per serie/forfatter/...)
    • Proof-of-concept, – proof of concept, ikke lanceret mod slutbrugere endnu, stort set ingen indkomne links
    • Visionen var, at der også skulle være link til lokalt bibliotekssite ud fra GeoIP, etc.
  • Resultat
    • besøgt millioner af gange af robotter
    • Ca 1.000 organiske besøg per måned

Statistik bibdata.dk

Herunder udtrukket statistik (blot via nogle kommandolinje one-liners mod et statistik-dump i JSON-format).

Googlebot besøg:

  42042 2019-06
1965883 2019-07
 101848 2019-08
 182233 2019-09
  96517 2019-10
  14604 2019-11
  16517 2019-12
 190123 2020-01
  31140 2020-02
  22626 2020-03
  11113 2020-04
  11983 2020-05
  47094 2020-06
  16394 2020-07
  71598 2020-08

zcat stat.jsonl.gz | grep Googlebot |sed -e s'/..\(.......\).*/\1'/ | uniq -c


Besøg via google i gnmsnt. ca 1000/md:

     54 2019-06
    841 2019-07
   1504 2019-08
   1450 2019-09
   1191 2019-10
   1341 2019-11
   1089 2019-12
   1074 2020-01
   1023 2020-02
   1264 2020-03
   1136 2020-04
   1097 2020-05
    969 2020-06
    811 2020-07
   1106 2020-08
    258 2020-09

zcat stat.jsonl.gz | grep -v Googlebot | grep -i 'google[^"]*"]$' | sed -e s'/..\(.......\).*/\1/' | uniq -c

Heraf også 22 stk. med referrer fra hume.google.com, – der viser sig at være googles interne værktøj til deres knowledge graph (blandt andet til visning af facts ved siden af søgeresultater)

Og der er også besøg fra Jubii, Baidu, Yandex, Bing, ...


Robotter / user agents:

  • googlebot, applebot, bingbot
  • russisk: yandex
  • semrush, mj12bot(majestic), blexbot(webmeup), ahrefs, moz, barkrowler
  • kinesisk: petalbot(huwei), aspiegel(huwei), Mb2345Browser(2345), liebaofast(?), mqqbrowser(qq), ucbrowser(alibaba)
  • commoncrawl.org, netarkivet
  • egen adgang til data via python/curl/axios
8465128 Python-urllib/3.6
3086980 axios/0.19.0
2812324 Mozilla/5.0 (compatible; SemrushBot/6~bl; +http://www.semrush.com/bot.html)
2418188 curl/7.58.0
2265245 Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)
1855710 Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
1418830 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
1289139 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
 483667 Mozilla/5.0 (compatible; SemrushBot/3~bl; +http://www.semrush.com/bot.html)
 288961 Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://aspiegel.com/petalbot)
 245434 Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)
 232126 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)
 197984 Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; AspiegelBot)
 137867 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.92 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  49156 Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+http://aspiegel.com/petalbot)
  32773 Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36 Mb2345Browser/9.0
  32691 Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/43.0.2357.121 Mobile Safari/537.36 LieBaoFast/4.51.3
  32634 Mozilla/5.0 (Linux; Android 7.0; FRD-AL00 Build/HUAWEIFRD-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043602 Safari/537.36 MicroMessenger/6.5.16.1120 NetType/WIFI Language/zh_CN
  32265 Mozilla/5.0(Linux;U;Android 5.1.1;zh-CN;OPPO A33 Build/LMY47V) AppleWebKit/537.36(KHTML,like Gecko) Version/4.0 Chrome/40.0.2214.89 UCBrowser/11.7.0.953 Mobile Safari/537.36
  14246 Mozilla/5.0 (compatible; MJ12bot/v1.4.7; http://mj12bot.com/)
   8670 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)
   6024 Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
   5936 Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler)
   5710 Mozilla/5.0 (compatible; heritrix/3.3.0 +http://netarkivet.dk/webcrawler/)
   5484 Barkrowler/0.9 (+https://babbar.tech/crawler)
   4667 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
   3725 Mozilla/5.0 (compatible; Adsbot/3.1)
   3365 Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko
   3134 CCBot/2.0 (https://commoncrawl.org/faq/)

zcat stat.jsonl.gz | sed -e 's/.*"\(.*\)".*".*".*/\1/' | sort | uniq -c | sort -rn