digitalpebble.blogspot.com
DigitalPebble's Blog: NUTCH FIGHT! 1.7 vs 2.2.1
http://digitalpebble.blogspot.com/2013/09/nutch-fight-17-vs-221.html
Monday, 16 September 2013. 17 vs 2.2.1. We've had releases in the Nutch 2.x branch for over a year now. As I described in a. The main difference with the 1.x branch is the use of Apache Gora as a storage abstraction layer, which allows to use various flavours of NoSQL databases such as HBase, Cassandra or Accumulo as backends. We have measured the performance of Nutch 1.7 against 2.2.1 (HBase and Cassandra) using 3 million URLs from the CommonCrawl. Project. These URLs were. It is important to note that ...
digitalpebble.blogspot.com
DigitalPebble's Blog: What's new in Storm-Crawler 0.5
http://digitalpebble.blogspot.com/2015/06/whats-new-in-storm-crawler-05.html
Friday, 5 June 2015. What's new in Storm-Crawler 0.5. We've just released the version 0.5 of Storm-Crawler. Just over three months after the previous one. As you can read below, we've been pretty busy! The project got some great contributions from new users and is seeing an increase in adoption, which is very encouraging. One of the main improvements provided in the new release is the introduction of a Metadata object. Which replaces the Map String,String[]. This is now the one we use by default, the one...
mevivs.wordpress.com
Hector-Kundera | Vivek Mishra's Blog
https://mevivs.wordpress.com/2011/02/12/hector-kundera
Vivek Mishra's Blog. JPA Compliant& Annotation Based:. Makes it an entity class. Assign ColumnFamily type and name. The email address. */. The country. */. The registered. */. The name. */. Instantiates a new author. Author() { / must have a default constructor. Defines column family and keyspace of given entity. Configuration conf = new Configuration();. ConfgetEntityManager( unit-name );. Persistence and Search Using EntityManager :. String key = System. Key, "a@a.org", "India", new. AObj, aObj db);.
github.com
GitHub - DigitalPebble/storm-crawler: Web crawler SDK based on Apache Storm
https://github.com/DigitalPebble/storm-crawler
Web crawler SDK based on Apache Storm. Use Git or checkout with SVN using the web URL. Aug 23, 2016. Flush BulkProcessor before closing connection. Failed to load latest commit information. Jul 19, 2016. AbstractHttpProtocol : added utility class to get agent string from conf. Jul 21, 2016. Aug 23, 2016. Ignore OSX system files. Jan 29, 2016. Update .travis.yml. Jul 1, 2016. Added LICENSE and NOTICE; fixed license headers in files. Sep 5, 2014. May 25, 2016. Uped version of archetype in readme. This will...
github.com
DigitalPebble Ltd · GitHub
https://github.com/DigitalPebble
Http:/ www.digitalpebble.com. X67;ithub@digitalpebble.com. Web crawler SDK based on Apache Storm. Aug 23, 2016. WARC resources for StormCrawler. Jul 22, 2016. Behemoth is an open source platform for large scale document analysis based on Apache Hadoop. Apr 26, 2016. Azazello is an open source platform for large scale document analysis based on Apache Spark. Apr 20, 2016. Mirror of Apache Storm. Mar 16, 2016. A set of reusable Java components that implement functionality common to any web crawler. GATE Pr...
digitalpebble.blogspot.com
DigitalPebble's Blog: What's new in Storm-Crawler 0.4
http://digitalpebble.blogspot.com/2015/01/whats-new-in-storm-crawler-04.html
Wednesday, 28 January 2015. What's new in Storm-Crawler 0.4. We've recently released the version 0.4 of. Which is a collection of resources for building low-latency, large scale web crawlers with. The project has been really active in the last few months, thanks partly to our 2 fantastic new committers (Jake Dodd and Gui Forget) and as a result contains some important changes and improvements. Reorganisation of the code. That can be used to index documents with ElasticSearch. Stream, which is meant to be...
digitalpebble.blogspot.com
DigitalPebble's Blog: DigitalPebble is hiring!
http://digitalpebble.blogspot.com/2013/06/digitalpebble-is-hiring.html
Wednesday, 5 June 2013. We are looking for a candidate with the following skills and expertise :. Experience in web crawling, ideally with Apache Nutch. Storm, Hadoop and related technologies. Interest in text processing, NLP and ML. Good social and presentation skills. Good spoken and written English, knowledge of other languages would be a plus. Taste for challenges and problem solving. More details on our activities can be found on our website. The position is in Bristol, UK. Posted by Julien Nioche.
digitalpebble.blogspot.com
DigitalPebble's Blog: Nutch training course
http://digitalpebble.blogspot.com/2013/07/nutch-training-course.html
Monday, 29 July 2013. We are planning to run a 2-day training courses on Apache Nutch. On the 24/25 October 2013. It will take place in Bristol, UK (the exact venue will be announced later). The course has been put on hold for now. Please do get in touch if you are interested and I will keep you updated as soon as we reach a sufficient number of attendees. Note that the demonstrations and exercises will be based on a Linux OS. The program given here is an indication only and might change slightly. The pr...
digitalpebble.blogspot.com
DigitalPebble's Blog: January 2015
http://digitalpebble.blogspot.com/2015_01_01_archive.html
Wednesday, 28 January 2015. What's new in Storm-Crawler 0.4. We've recently released the version 0.4 of. Which is a collection of resources for building low-latency, large scale web crawlers with. The project has been really active in the last few months, thanks partly to our 2 fantastic new committers (Jake Dodd and Gui Forget) and as a result contains some important changes and improvements. Reorganisation of the code. That can be used to index documents with ElasticSearch. Stream, which is meant to be...