7.6. Load XML - Chapter 7. Export / Import

7.6.1. Load XML Introduction

Many existing (enterprise) applications, endpoints and files use XML as data exchange format.

To make these datastructures available to Cypher, you can use apoc.load.xml. It takes a file or http URL and parses the XML into a map datastructure.

in previous releases we’ve had apoc.load.xmlSimple. This is now deprecated and got superseeded by apoc.load.xml(url, [xPath], [config], true).Simple XML Format

See the following usage-examples for the procedures.

7.6.2. Example File

"How do you access XML doc attributes in children fields ?"

(Thanks Nicolas Rouyer)

For example, if my XML file is the example book.xml provided by Microsoft.

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
...

We have the file here, on GitHub.

7.6.3. Simple XML Format

In a simpler XML representation, each type of children gets it’s own entry within the parent map. The element-type as key is prefixed with "_" to prevent collisions with attributes.

If there is a single element, then the entry will just have that element as value, not a collection. If there is more than one element there will be a list of values.

Each child will still have its _type field to discern them.

Here is the example file from above loaded with apoc.load.xmlSimple

call apoc.load.xml("https://raw.githubusercontent.com/neo4j-contrib/neo4j-apoc-procedures/3.4/src/test/resources/xml/books.xml", '', {}, true)

{_type: "catalog", _book: [
  {_type: "book", id: "bk101",
    _author: [{_type: "author", _text: "Gambardella, Matthew"},{_type: author, _text: "Arciniegas, Fabio"}],
    _title: {_type: "title", _text: "XML Developer's Guide"},
    _genre: {_type: "genre", _text: "Computer"},
    _price: {_type: "price", _text: "44.95"},
    _publish_date: {_type: "publish_date", _text: "2000-10-01"},
    _description: {_type: description, _text: An in-depth look at creating applications ....

7.6.3.1. Simple XML Examples

Example 1.

WITH "https://maps.googleapis.com/maps/api/directions/xml?origin=Mertens%20en%20Torfsstraat%2046,%202018%20Antwerpen&destination=Rubensstraat%2010,%202300%20Turnhout&sensor=false&mode=bicycling&alternatives=false&key=AIzaSyAPPIXGudOyHD_KAa2f_1l_QVNbsd_pMQs" AS url
CALL apoc.load.xmlSimple(url) YIELD value
RETURN value._route._leg._distance._value, keys(value), keys(value._route), keys(value._route._leg), keys(value._route._leg._distance._value)

Example 2.

WITH "https://maps.googleapis.com/maps/api/directions/xml?origin=Mertens%20en%20Torfsstraat%2046,%202018%20Antwerpen&destination=Rubensstraat%2010,%202300%20Turnhout&sensor=false&mode=bicycling&alternatives=false&key=AIzaSyAPPIXGudOyHD_KAa2f_1l_QVNbsd_pMQs" AS url
CALL apoc.load.xmlSimple(url) YIELD value
UNWIND keys(value) AS key
RETURN key, apoc.meta.type(value[key]);

7.6.3.2. HTTP Headers

You can provide a map of HTTP headers to the config property.

WITH { `X-API-KEY`: 'abc123' } as headers,
WITH "https://myapi.com/api/v1/" AS url
CALL apoc.load.xml(url, '', { headers: headers }) YIELD value
UNWIND keys(value) AS key
RETURN key, apoc.meta.type(value[key]);

7.6.4. xPath

It’s possible to define a xPath (optional) to selecting nodes from the XML document.

7.6.4.1. xPath Example

From the Microsoft’s book.xml file we can get only the books that have as genre Computer

call apoc.load.xml("https://raw.githubusercontent.com/neo4j-contrib/neo4j-apoc-procedures/3.4/src/test/resources/xml/books.xml", '/catalog/book[genre=\"Computer\"]') yield value as book
WITH book.id as id, [attr IN book._children WHERE attr._type IN ['title','price'] | attr._text] as pairs
RETURN id, pairs[0] as title, pairs[1] as price

In this case we return only id, title and prize but we can return any other elements

We can also return just a single specific element. For example the author of the book with id = bg102

call apoc.load.xml('https://raw.githubusercontent.com/neo4j-contrib/neo4j-apoc-procedures/3.4/src/test/resources/xml/books.xml', '/catalog/book[@id="bk102"]/author') yield value as result
WITH result._text as author
RETURN author

7.6.5. Load XML and Introspect

Let’s just load it and see what it looks like. It’s returned as value map with nested _type and _children fields, per group of elements. Attributes are turned into map-entries. And each element into their own little map with _type, attributes and _children if applicable.

call apoc.load.xml("https://raw.githubusercontent.com/neo4j-contrib/neo4j-apoc-procedures/3.4/src/test/resources/xml/books.xml")

{_type: catalog, _children: [
  {_type: book, id: bk101, _children: [
    {_type: author, _text: Gambardella, Matthew},
    {_type: title, _text: XML Developer's Guide},
    {_type: genre, _text: Computer},
    {_type: price, _text: 44.95},
    {_type: publish_date, _text: 2000-10-01},
    {_type: description, _text: An in-depth look at creating applications ....

7.6.5.1. For each book, how do I access book id ?

You can access attributes per element directly.

call apoc.load.xml("https://raw.githubusercontent.com/neo4j-contrib/neo4j-apoc-procedures/3.4/src/test/resources/xml/books.xml") yield value as catalog
UNWIND catalog._children as book
RETURN book.id

╒═══════╕
│book.id│
╞═══════╡
│bk101  │
├───────┤
│bk102  │

7.6.5.2. For each book, how do I access book author and title ?

Filter into collection

You have to filter over the sub-elements in the _childrens array in this case.

call apoc.load.xml("https://raw.githubusercontent.com/neo4j-contrib/neo4j-apoc-procedures/3.4/src/test/resources/xml/books.xml") yield value as catalog
UNWIND catalog._children as book
RETURN book.id, [attr IN book._children WHERE attr._type IN ['author','title'] | [attr._type, attr._text]] as pairs

╒═══════╤════════════════════════════════════════════════════════════════════════╕
│book.id│pairs                                                                   │
╞═══════╪════════════════════════════════════════════════════════════════════════╡
│bk101  │[[author, Gambardella, Matthew], [title, XML Developer's Guide]]        │
├───────┼────────────────────────────────────────────────────────────────────────┤
│bk102  │[[author, Ralls, Kim], [title, Midnight Rain]]                          │

How do I return collection elements?

This is not too nice, we could also just have returned the values and then grabbed them out of the list, but that relies on element-order.

call apoc.load.xml("https://raw.githubusercontent.com/neo4j-contrib/neo4j-apoc-procedures/3.4/src/test/resources/xml/books.xml") yield value as catalog
UNWIND catalog._children as book
WITH book.id as id, [attr IN book._children WHERE attr._type IN ['author','title'] | attr._text] as pairs
RETURN id, pairs[0] as author, pairs[1] as title

╒═════╤════════════════════╤══════════════════════════════╕
│id   │author              │title                         │
╞═════╪════════════════════╪══════════════════════════════╡
│bk101│Gambardella, Matthew│XML Developer's Guide         │
├─────┼────────────────────┼──────────────────────────────┤
│bk102│Ralls, Kim          │Midnight Rain                 │

7.6.6. Extracting Datastructures

7.6.6.1. Turn Pairs into Map

So better is to turn them into a map with apoc.map.fromPairs

call apoc.load.xml("https://raw.githubusercontent.com/neo4j-contrib/neo4j-apoc-procedures/3.4/src/test/resources/xml/books.xml") yield value as catalog
UNWIND catalog._children as book
WITH book.id as id, [attr IN book._children WHERE attr._type IN ['author','title'] | [attr._type, attr._text]] as pairs
CALL apoc.map.fromPairs(pairs) yield value
RETURN id, value

╒═════╤════════════════════════════════════════════════════════════════════╕
│id   │value                                                               │
╞═════╪════════════════════════════════════════════════════════════════════╡
│bk101│{author: Gambardella, Matthew, title: XML Developer's Guide}        │
├─────┼────────────────────────────────────────────────────────────────────┤
│bk102│{author: Ralls, Kim, title: Midnight Rain}                          │
├─────┼────────────────────────────────────────────────────────────────────┤
│bk103│{author: Corets, Eva, title: Maeve Ascendant}                       │

Return individual Columns

And now we can cleanly access the attributes from the map.

call apoc.load.xml("https://raw.githubusercontent.com/neo4j-contrib/neo4j-apoc-procedures/3.4/src/test/resources/xml/books.xml") yield value as catalog
UNWIND catalog._children as book
WITH book.id as id, [attr IN book._children WHERE attr._type IN ['author','title'] | [attr._type, attr._text]] as pairs
CALL apoc.map.fromPairs(pairs) yield value
RETURN id, value.author, value.title

╒═════╤════════════════════╤══════════════════════════════╕
│id   │value.author        │value.title                   │
╞═════╪════════════════════╪══════════════════════════════╡
│bk101│Gambardella, Matthew│XML Developer's Guide         │
├─────┼────────────────────┼──────────────────────────────┤
│bk102│Ralls, Kim          │Midnight Rain                 │
├─────┼────────────────────┼──────────────────────────────┤
│bk103│Corets, Eva         │Maeve Ascendant               │

7.6.7. import xml directly

In case you don’t want to transform your xml (like you do with apoc.load.xml/apoc.load.xmlSimple before you create nodes and relationships and you want to have a 1:1 mapping of xml into the graph you can use apoc.xml.import.

7.6.7.1. usage

CALL apoc.xml.import(<url>, <config-map>?) YIELD node

The procedure will return a node representing the xml document containing nodes/rels underneath mapping to the xml structure. The following mapping rules are applied:

xml	label	properties
document	XmlDocument	_xmlVersion, _xmlEncoding
processing instruction	XmlProcessingInstruction	_piData, _piTarget
Element/Tag	XmlTag	_name
Attribute	n/a	property in the XmlTag node
Text	XmlWord	for each word a separate node is created

The nodes for the xml document are connected:

relationship type	description
:IS_CHILD_OF	pointing to a nested xml element
:FIRST_CHILD_OF	pointing to the first child
:NEXT_SIBLING	pointing to the next xml element on the same nesting level
:NEXT	produces a linear chain through the full document
:NEXT_WORD	only produced if config map has `createNextWordRelationships:true`. Connects words in xml to a text flow.

The following options are available for the config map:

config option	default value	description
connectCharacters	false	if `true` the xml text elements are child nodes of their tags, interconnected by relationships of type `relType` (see below)
filterLeadingWhitespace	false	if `true` leading whitespace is skipped for each line
delimiter	`\s` (regex whitespace)	if given, split text elements with the delimiter into separate nodes
label	XmlCharacter	label to use for text element representation
relType	`NE`	relationship type to be used for connecting the text elements into one linked list
charactersForTag	{}	map of tagname → string. For the given tag names an additional text element is added containing the value as `text` property. Useful e.g. for `<lb/>` tags in TEI-XML to be represented as `<lb> </lb>`.

7.6.7.2. example

call
apoc.xml.import("https://raw.githubusercontent.com/neo4j-contrib/neo4j-apoc-procedures/3.4/src/test/resources/xml/books.xml",{createNextWordRelationships:
true})
yield node
return node;

call apoc.xml.import('https://seafile.rlp.net/f/6282a26504cc4f079ab9/?dl=1', {connectCharacters: true, charactersForTag:{lb:' '}, filterLeadingWhitespace: true}) yield node
return node;

7.6.9. Config

Config param is optional, the default value is an empty map.

`charset`	Default: UTF-8
`baserUri`	Default: "", it is use to resolve relative paths

7.6.10. Example with real data

For these example we use the wikipedia home page "https://en.wikipedia.org/"

7.6.10.1. Examples

CALL apoc.load.html("https://en.wikipedia.org/",{metadata:"meta", h2:"h2"})

You will get this result:

CALL apoc.load.html("https://en.wikipedia.org/",{links:"link"})

You will get this result:

CALL apoc.load.html("https://en.wikipedia.org/",{metadata:"meta", h2:"h2"}, {charset: "UTF-8})

You will get this result:

7.6.12. Example

You can export your graph as an unweighted network.

match path = (:Person)-[:ACTED_IN]->(:Movie)
WITH path LIMIT 1000
with collect(path) as paths
call apoc.gephi.add(null,'workspace0', paths) yield nodes, relationships, time
return nodes, relationships, time

You can export your graph as a weighted network, by specifying the property of a relationship, that holds the weight value.

match path = (:Person)-[r:ACTED_IN]->(:Movie) where exists r.weightproperty
WITH path LIMIT 1000
with collect(path) as paths
call apoc.gephi.add(null,'workspace0', paths, 'weightproperty') yield nodes, relationships, time
return nodes, relationships, time

You can also export with your graph other properties of your nodes and/or relationship by adding an optional array with the property names you want to export. Example for exporting birthYear and role property.

match path = (:Person)-[r:ACTED_IN]->(:Movie) where exists r.weightproperty
WITH path LIMIT 1000
with collect(path) as paths
call apoc.gephi.add(null,'workspace0', paths, 'weightproperty',['birthYear', 'role']) yield nodes, relationships, time
return nodes, relationships, time

7.6.12.1. Format

We send all nodes and relationships of the passed in data convert into individual Gephi-Streaming JSON fragements, separated by \r\n.

{"an":{"123":{"TYPE":"Person:Actor","label":"Tom Hanks",                           x:333,y:222,r:0.1,g:0.3,b:0.5}}}\r\n
{"an":{"345":{"TYPE":"Movie","label":"Forrest Gump",                               x:234,y:122,r:0.2,g:0.2,b:0.7}}}\r\n
{"ae":{"3344":{"TYPE":"ACTED_IN","label":"Tom Hanks",source:"123",target:"345","directed":true,"weight":1.0,r:0.1,g:0.3,b:0.5}}}

7.6.13. Specifics Details

Gephi doesn’t render the graph data unless you also provide x,y coordinates in the payload, so we just send random ones within a 1000x1000 grid.

We also generate colors per label combination and relationship-type, both of which are also transferred as TYPE property.

You can have your weight property stored as a number (integer,float) or a string. If the weight property is invalid or null, it will use the default 1.0 value.

7.6.14. Load Single File From Compressed File (zip/tar/tar.gz/tgz)

For load file from compressed file the url right syntax is:

apoc.load.csv("pathToFile!csv/fileName.csv")

apoc.load.json("https://github.com/neo4j-contrib/neo4j-apoc-procedures/tree/3.4/src/test/resources/testload.tgz?raw=true!person.json");

You have to put the ! character before the filename.

7.6.15. Using S3 protocol

For using S3 protocol you have to copy these jars into the plugins directory:

aws-java-sdk-core-1.11.250.jar (https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-core/1.11.250)
aws-java-sdk-s3-1.11.250.jar (https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3/1.11.250)
httpclient-4.4.8.jar (https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient/4.5.4)
httpcore-4.5.4.jar (https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore/4.4.8)
joda-time-2.9.9.jar (https://mvnrepository.com/artifact/joda-time/joda-time/2.9.9)

S3 Url must be:

s3://accessKey:secretKey@endpoint:port/bucket/key or
s3://endpoint:port/bucket/key?accessKey=accessKey&secretKey=secretKey

7.6.16. failOnError

Adding on config the parameter failOnError:false (by default true), in case of error the procedure don’t fail but just return zero rows.