TIKA提取ODF

下面給出的是程序從打開Office文檔格式(ODF)中提取內容和元數據。

import java.io.File; import java.io.FileInputStream; import java.io.IOException; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.odf.OpenDocumentParser; import org.apache.tika.sax.BodyContentHandler; import org.xml.sax.SAXException; public class OpenDocumentParse { public static void main(final String[] args) throws IOException,SAXException, TikaException{ //detecting the file type BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("example_open_document_presentation.odp")); ParseContext pcontext = new ParseContext(); //Open Document Parser OpenDocumentParser openofficeparser = new OpenDocumentParser (); openofficeparser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); System.out.println("Metadata of the document:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + " : " + metadata.get(name)); } } }

將以上代碼保存爲OpenDocumentParse.java,並通過使用以下命令,在命令提示符下編譯:

javac OpenDocumentParse.java
java OpenDocumentParse

下面給出的是example_open_document_presentation.odp的快照:

Presentation

本文檔具有以下屬性:

Example2

執行上述程序後,將得到下面的輸出。

輸出:

Contents of the document:
Apache Tika
Apache Tika is a framework for content type detection and content extraction which was designed
by Apache software foundation. It detects and extracts metadata and structured text content from
different types of documents such as spreadsheets, text documents, images or PDFs including audio
or video input formats to certain extent.

Metadata of the document:
editing-cycles: 4
meta:creation-date: 2009-04-16T11:32:32.86
dcterms:modified: 2014-09-28T07:46:13.03
meta:save-date: 2014-09-28T07:46:13.03
Last-Modified: 2014-09-28T07:46:13.03
dcterms:created: 2009-04-16T11:32:32.86
date: 2014-09-28T07:46:13.03
modified: 2014-09-28T07:46:13.03
nbObject: 36
Edit-Time: PT32M6S
Creation-Date: 2009-04-16T11:32:32.86
Object-Count: 36
meta:object-count: 36
generator: OpenOffice/4.1.0$Win32 OpenOffice.org_project/410m18$Build-9764
Content-Type: application/vnd.oasis.opendocument.presentation
Last-Save-Date: 2014-09-28T07:46:13.03