We have developed our own powerful web scraper technology as a part of our portal solutions. It has been used for a long time in our portals in real-time mode. Surely we can use it for your tasks.
As the technology is ready, we don’t need to spend a lot of time writing it from scratch. All we need is to write plugins for source sites and for each output format.
It will drastically reduce the project term and save the costs.
This scraper can collect different data and store it to a customized target format (like CSV or Excel).
The technology is mature and highly customizable, it has its own database with history of already scraped data that allows reconciling the data; checking history of the data which were scraped and version history of each entity spidered; checking if similar data already exist and so on.
You can easily add new sources for the scraper. For each new source the programmer will write a plugin. So, the scraper can be updated easily, rapidly and cost effectively to support new sources of data. For example, if it is a scraper for news, it can take news from different sites, but aggregate them into the same output format (db, excel or csv)
You can define how to store data. You can store data to multiple locations in different formats. Data from all inputs are stored in one output format and mixed together
For each entity you scrap you can define attributes via UI for further per attribute configuration (in blacklist)
Rules which source data should be omitted from the scraper. For each field of source data black list rules can be configured. If an attribute of a source entity contains black list data for such an attribute, the entity will be omitted.
Checking of previously uploaded data: if updates happen in source entities, target entities will be overwritten (new row will not be added, but it will be overwritten instead). If source entities were deleted, target entities will be deleted too. If there are new source entities, they will be added to the target dataset.
History of each row versions (in case of updates) is stored. Web-interface of history exists. It works in case of checking for updates capability.
History of what and when have been scraped can be stored or omitted.
If text data are equal or similar (with a defined level of similarity), new source data can be omitted with logging such information into a log
Web-interface where you can set parameters of start, define schedule, black lists, select modes, see version history and operations history etc. It supports a set of scrapers that can work in parallel.
Schedule of the scraper is configurable and can run every week or due to your requirements.
You can run the scraper manually or re-scrap a selected entity from the scraping history in the UI
We have technology of automatic classification of content. You can define directory of terms (including multi word terms), and input data will be automatically classified, and even related terms could be selected by tags.
For examle, for our medical portal. We have news that are crowled periodically. After crowling they are classified, and terms are hyperlinks to the term page.
For example: here is a list of terms in terms admin tool:
You can see hyperlinks in the text. They are automatically created. Also automatically news was classified us related to «Атеросклероз» (in cyrillic, sorry), and link to this term’s page was created under the news text. If you press on this link you can see the backreference link to this news from the term page.
Also you can see, we can classify also phrases, not only single words. You can see one two-word link — «рыбий жир».
Also you may see that not all of the links are reflected under the bottom of the news. It’s because of type of terms. We have terms of type «desease» and «substance». And we define different processing in dependence of what is the type of the term.
So, we can classify crowled content automatically, and do any actions on classified content.
Portal url: http://www.med-life.com/
This is a medical portal with news and medical information. News and medical instructions are scraped (two scrapers, as different types of entities)
You can see the instructions scraped by this scraper at http://www.med-life.com/drugalphadir
This scraper is of alphanumeric type and intended for scraping alphabetic catalogs (in opposite to pagered list based source web data supported by previous scraper) with the ability to refresh by letter
Portal url: http://www.inautoclub.net/
Here you can see:
So, for each source site we can estimate the required time before start and you will know the cost before.