kris-sigur.blogspot.com kris-sigur.blogspot.com

KRIS-SIGUR.BLOGSPOT.COM

Kris's blog

August 17, 2015. The WARC Format 1.1. The WARC Format 1.0. Is an ISO standard describing the container format that web archives use to store their data. WARCs contain not only the actual file resources (HTML, images, JavaScript etc.) but also request and response headers, metadata about the resources and the overall collection, deduplication records and conversion records. It's a pretty flexible format. It has served us quite well, but it is not perfect. To help with the procedural aspect we came up with...

http://kris-sigur.blogspot.com/

WEBSITE DETAILS
SEO
PAGES
SIMILAR SITES

TRAFFIC RANK FOR KRIS-SIGUR.BLOGSPOT.COM

TODAY'S RATING

>1,000,000

TRAFFIC RANK - AVERAGE PER MONTH

BEST MONTH

November

AVERAGE PER DAY Of THE WEEK

HIGHEST TRAFFIC ON

Saturday

TRAFFIC BY CITY

CUSTOMER REVIEWS

Average Rating: 3.2 out of 5 with 9 reviews
5 star
3
4 star
0
3 star
4
2 star
0
1 star
2

Hey there! Start your review of kris-sigur.blogspot.com

AVERAGE USER RATING

Write a Review

WEBSITE PREVIEW

Desktop Preview Tablet Preview Mobile Preview

LOAD TIME

0.8 seconds

FAVICON PREVIEW

  • kris-sigur.blogspot.com

    16x16

  • kris-sigur.blogspot.com

    32x32

  • kris-sigur.blogspot.com

    64x64

  • kris-sigur.blogspot.com

    128x128

CONTACTS AT KRIS-SIGUR.BLOGSPOT.COM

Login

TO VIEW CONTACTS

Remove Contacts

FOR PRIVACY ISSUES

CONTENT

SCORE

6.2

PAGE TITLE
Kris's blog | kris-sigur.blogspot.com Reviews
<META>
DESCRIPTION
August 17, 2015. The WARC Format 1.1. The WARC Format 1.0. Is an ISO standard describing the container format that web archives use to store their data. WARCs contain not only the actual file resources (HTML, images, JavaScript etc.) but also request and response headers, metadata about the resources and the overall collection, deduplication records and conversion records. It's a pretty flexible format. It has served us quite well, but it is not perfect. To help with the procedural aspect we came up with...
<META>
KEYWORDS
1 kris's blog
2 posted by
3 kristinn sigurðsson
4 1 comment
5 email this
6 blogthis
7 share to twitter
8 share to facebook
9 share to pinterest
10 labels warc
CONTENT
Page content here
KEYWORDS ON
PAGE
kris's blog,posted by,kristinn sigurðsson,1 comment,email this,blogthis,share to twitter,share to facebook,share to pinterest,labels warc,webarchiving,customizing heritrix reports,and the crawlrss,class in heritrix,no comments,labels heritrix,labels iipc
SERVER
GSE
CONTENT-TYPE
utf-8
GOOGLE PREVIEW

Kris's blog | kris-sigur.blogspot.com Reviews

https://kris-sigur.blogspot.com

August 17, 2015. The WARC Format 1.1. The WARC Format 1.0. Is an ISO standard describing the container format that web archives use to store their data. WARCs contain not only the actual file resources (HTML, images, JavaScript etc.) but also request and response headers, metadata about the resources and the overall collection, deduplication records and conversion records. It's a pretty flexible format. It has served us quite well, but it is not perfect. To help with the procedural aspect we came up with...

INTERNAL PAGES

kris-sigur.blogspot.com kris-sigur.blogspot.com
1

Kris's blog: The WARC Format 1.1

http://www.kris-sigur.blogspot.com/2015/08/the-warc-format-11.html

Web archiving, mostly from a technical aspect. Plus other things, occassionally. August 17, 2015. The WARC Format 1.1. The WARC Format 1.0. Is an ISO standard describing the container format that web archives use to store their data. WARCs contain not only the actual file resources (HTML, images, JavaScript etc.) but also request and response headers, metadata about the resources and the overall collection, deduplication records and conversion records. To help with the procedural aspect we came up with a...

2

Kris's blog: August 2015

http://www.kris-sigur.blogspot.com/2015_08_01_archive.html

Web archiving, mostly from a technical aspect. Plus other things, occassionally. August 24, 2015. We started doing deduplication four years before we started using WARC. As the ARC format had no revisit. Concept, the only record of the deduplicated items from that era lies in the crawl logs. When we put our collection online back in 2009 we built our own indexer that consumed these crawl logs so we could include these items. It worked very well at the time. Already does this, it was a minor task to adapt...

3

Kris's blog: December 2014

http://www.kris-sigur.blogspot.com/2014_12_01_archive.html

Web archiving, mostly from a technical aspect. Plus other things, occassionally. December 16, 2014. Deduplicating text based data. In my last two posts about deduplication, you may have noticed the following caveat:. It should also be noted that only URLs whose content (mime) type did not. Begin with "text/" were deduplicated. The reasons for ignoring text documents derive from analysis I did 8-9 years ago when first developing the DeDuplicator. In my last post. There was a table that showed that a total...

4

Kris's blog: June 2015

http://www.kris-sigur.blogspot.com/2015_06_01_archive.html

Web archiving, mostly from a technical aspect. Plus other things, occassionally. June 30, 2015. Web archiving APIs - a start. In fact, we already have one web archive "API" in wide use; the WARC file format. While technically not an " application programming interface" it serves the same fundamental purpose, to enable interoperability. It has decoupled harvesters (e.g. Heritrix) from replay systems (e.g. OpenWayback) and both of those from analytical/data mining software etc. We do have a few informal "A...

5

Kris's blog: Deduplicating text based data

http://www.kris-sigur.blogspot.com/2014/12/deduplicating-text-based-data.html

Web archiving, mostly from a technical aspect. Plus other things, occassionally. December 16, 2014. Deduplicating text based data. In my last two posts about deduplication, you may have noticed the following caveat:. It should also be noted that only URLs whose content (mime) type did not. Begin with "text/" were deduplicated. The reasons for ignoring text documents derive from analysis I did 8-9 years ago when first developing the DeDuplicator. In my last post. There was a table that showed that a total...

UPGRADE TO PREMIUM TO VIEW 14 MORE

TOTAL PAGES IN THIS WEBSITE

19

OTHER SITES

kris-shatty-boy-crew.skyrock.com kris-shatty-boy-crew.skyrock.com

kris-shatty-boy-crew's blog - Représente madinina 972 - Skyrock.com

More options ▼. Subscribe to my blog. Created: 03/07/2011 at 9:37 PM. Updated: 05/08/2011 at 8:55 PM. Add this video to my blog. Don't forget that insults, racism, etc. are forbidden by Skyrock's 'General Terms of Use' and that you can be identified by your IP address (66.160.134.3) if someone makes a complaint. Please enter the sequence of characters in the field below. Posted on Thursday, 07 July 2011 at 11:49 AM. Thu, July 07, 2011. Subscribe to my blog! Post to my blog. Here you are free.

kris-sheridan.com kris-sheridan.com

Seiten in Vorbereitung

404: Not Found - www.kris-sheridan.com.

kris-shoes.fr kris-shoes.fr

Roux Chaussures - le blog

Skip to main content. Roux Chaussures - le blog. Bien choisir ses chaussures. Retrouvez-nous également sur facebook et bénéficiez d'offres promotionnelles exclusives en devenant Fan. Le chausson Suédois en mouton. Non seulement ils sont beaux, ils sont chauds mais en plus ils sont écolos. Que de compliments pour cette marque dont nous sommes relativement fans. Les matières sont belles et douces. Vos pieds seront au chaud sans transpirer. Retrouvez sur notre site internet toutes nos pantoufles pour femme.

kris-sigur.blogspot.com kris-sigur.blogspot.com

Kris's blog

August 17, 2015. The WARC Format 1.1. The WARC Format 1.0. Is an ISO standard describing the container format that web archives use to store their data. WARCs contain not only the actual file resources (HTML, images, JavaScript etc.) but also request and response headers, metadata about the resources and the overall collection, deduplication records and conversion records. It's a pretty flexible format. It has served us quite well, but it is not perfect. To help with the procedural aspect we came up with...

kris-sinnaeve.be kris-sinnaeve.be

Foto-Kris-Torhout - Kris Sinnaeve